Parallel text processing in julia - parallel-processing

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.
If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:
function open_count(file)
fh = open(file)
text = readall(fh)
length(split(text))
end
tic()
total = 0
for name in files
total += open_count(string(dir,"/",name))
total
end
toc()
elapsed time: 29.474181026 seconds
tic()
total = 0
total = #parallel (+) for name in files
open_count(string(dir,"/",name))
end
toc()
elapsed time: 29.086511895 seconds
I tried different versions but also got no significant speed increases. Am I doing something wrong?

I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.
If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there.
You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.
However, if the time is used to do the regex, than consider the following:
Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.
This is a technique I used when trying to process large text files.

Related

What is the best way to do a very large and repetitive task?

I need to get the duration of 18000+ audios, using the audioread library for each audio takes about 300ms, ie at least 25~30 minutes of processing.
Using a Queue andProcess system, using all available cores of my processor, I can lower the processing average of each audio to 70ms, but it will still take 21 minutes, how can I improve this? I would like to be able to read all the audios in at least 5 minutes, remembering that I have no competition on the machine, it will only run my software, so I can consume all the resources.
Code of function read the duration:
def get_duration(q, au):
while not q.empty():
index = q.get()
with audioread.audio_open(au[index]['absolute_path']) as f:
au[index]['duration'] = f.duration * 1000
Code to create the Processes:
for i in range(os.cpu_count()):
pr = Process(target=get_duration, args=(queue, audios, ))
pr.daemon = True
pr.start()
In my code there is only one Queue with some Process, and I use Manager to edit the objects.
I would look into using a complied solution to speed up your script. Pythons threading still leaves room for improvement.
Try a compiled language.
If you still want python, separate as many functions to different threads as your system supports. use threading.thread.
Also see this wiki for info on efficient loops in python. The gist of the story is that for loops incur overhead.

Memory usage for a parallel stream from File.lines()

I am reading lines from large files (8GB+) using Files.lines(). If processing sequentially it works great, with a very low memory footprint. As soon as I add parallel() to the stream it seems to hang onto the data it is processing perpetually, eventually causing an out of memory exception. I believe this is the result of the Spliterator caching data when trying to split, but I'm not sure. My only idea left is to write a custom Spliterator with a trySplit method that peels off a small amount of data to split instead of trying to split the file in half or more. Has anyone else encountered this?
Tracing through the code my guess is the Spliterator used by Files.lines() is Spliterators.IteratorSpliterator. whose trySplit() method has this comment:
/*
* Split into arrays of arithmetically increasing batch
* sizes. This will only improve parallel performance if
* per-element Consumer actions are more costly than
* transferring them into an array. The use of an
* arithmetic progression in split sizes provides overhead
* vs parallelism bounds that do not particularly favor or
* penalize cases of lightweight vs heavyweight element
* operations, across combinations of #elements vs #cores,
* whether or not either are known. We generate
* O(sqrt(#elements)) splits, allowing O(sqrt(#cores))
* potential speedup.
*/
The code then looks like it splits into batches of multiples of 1024 records (lines). So the first split will read 1024 lines then the next one will read 2048 lines etc on and on. Each split will read larger and larger batch sizes.
If your file is really big, it will eventually hit a max batch size of 33,554,432 which is 1<<25. Remember that's lines not bytes which will probably cause an out of memory error especially when you start having multiple threads read that many.
That also explains the slow down. Those lines are read ahead of time before the thread can process those lines.
So I would either not use parallel() at all or if you must because the computations you are doing are expensive per line, write your own Spliterator that doesn't split like this. Probably just always using a batch of 1024 is fine.
As mentioned by dkatzel.
This problem is caused by the Spliterator.IteratorSplitter which will batch the elements in your stream. Where the batch size will start with 1024 elements and grow to 33,554,432 elements.
Another solution for this can be to use the FixedBatchSpliteratorBase which is proposed in the article on Faster parallel processing in Java using Streams and a spliterator.

Parallel computing and Julia

I have some performance problems with parallel computing in Julia. I am new in both, Julia and parallel calculations.
In order to learn, I parallelized a code that should benefits from parallelization, but it does not.
The program estimates the mean of the mean of the components of arrays whose elements were chosen randomly with an uniform distribution.
Serial version
tic()
function mean_estimate(N::Int)
iter = 100000*2
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
a = mean_estimate(0)
toc()
println("The mean is: ", a)
Parallelized version
addprocs(CPU_CORES - 1)
println("CPU cores ", CPU_CORES)
tic()
#everywhere function mean_estimate(N::Int)
iter = 100000
p = 5000
vec_mean = zeros(iter)
for i = 1:iter
vec_mean[i] = mean( rand(p) )
end
return mean(vec_mean)
end
the_mean = mean(vcat(pmap(mean_estimate,[1,2])...))
toc()
println("The mean is: ", the_mean)
Notes:
The factor 2 in the fourth line of the serial code is because I tried the code in a PC with two cores.
I checked the usage of the two cores with htop, and it seems to be ok.
The outputs I get are:
me#pentium-ws:~/average$ time julia serial.jl
elapsed time: 2.68671022 seconds
The mean is: 0.49999736055814215
real 0m2.961s
user 0m2.928s
sys 0m0.116s
and
me#pentium-ws:~/average$ time julia -p 2 parallel.jl
CPU cores 2
elapsed time: 2.890163089 seconds
The mean is: 0.5000104221069994
real 0m7.576s
user 0m11.744s
sys 0m0.308s
I've noticed that the serial version is slightly faster than the parallelized one for the timed part of the code. Also, that there is large difference in the total execution time.
Questions
Why is the parallelized version slower? (what I am doing wrong?)
Which is the right way to parallelize this program?
Note: I use pmap with vcat because I wish to try with the median too.
Thanks for your help
EDIT
I measured times as #HighPerformanceMark suggested. The tic()/toc() times are the following. The iteration number is 2E6 for every case.
Array Size Single thread Parallel Ratio
5000 2.69 2.89 1.07
100 000 488.77 346.00 0.71
1000 000 4776.58 4438.09 0.93
I am puzzled about why there is not clear trend with array size.
You should pay prime attention to suggestions in the comments.
As #ChrisRackauckas points out, type instability is a common stumbling block for performant Julia code. If you want highly performant code, then make sure that your functions are type-stable. Consider annotating the return type of the function pmap and/or vcat, e.g. f(pids::Vector{Int}) = mean(vcat(pmap(mean_estimate, pids))) :: Float64 or something similar, since pmap does not strongly type its output. Another strategy is to roll your own parallel scheduler. You can use pmap source code as a springboard (see code here).
Furthermore, as #AlexMorley commented, you are confounding your performance measurements by including compilation times. Normally performance of a function f() is measured in Julia by running it twice and measuring only the second run. In the first run, the JIT compiler compiles f() before running it, while the second run uses the compiled function. Compilation incurs a (unwanted) performance cost, so timing the second run avoid measuring the compilation.
If possible, preallocate all outputs. In your code, you have set each worker to allocate its own zeros(iter) and its own rand(p). This can have dramatic performance consequences. A sketch of your code:
# code mean_estimate as two functions
f(p::Int) = mean(rand(p))
function g(iter::Int, p::Int)
vec_mean = zeros(iter)
for i in eachindex(vec_mean)
vec_mean[i] = f(p)
end
return mean(vec_mean)
end
# run twice, time on second run to get compute time
g(200000, 5000)
#time g(200000, 5000)
### output on my machine
# 2.792953 seconds (600.01 k allocations: 7.470 GB, 24.65% gc time)
# 0.4999951853035917
The #time macro is alerting you that the garbage collector is cleaning up a lot of allocated memory during execution, several gigabytes in fact. This kills performance. Memory allocations may be overshadowing any distinction between your serial and parallel compute times.
Lastly, remember that parallel computing incurs overhead from scheduling and managing individual workers. Your workers are computing the mean of the means of many random vectors of length 5000. But you could succinctly compute the mean (or median) of, say, 5M entries with
x = rand(5_000_000)
mean(x)
#time mean(x) # 0.002854 seconds (5 allocations: 176 bytes)
so it is unclear how your parallel computing scheme improves upon serial performance. Parallel computing generally provides the best help when your arrays are truly beefy or your calculations are arithmetically intense, and vector means probably do not fall in that domain.
One last note: you may want to peek at SharedArrays, which distribute arrays over several workers with a common memory pool, or the experimental multithreading facilities in Julia. You may find those parallel frameworks more intuitive than pmap.

Different running times with Python

I'm writing a very simple program to calculate the factorial of a number.
Here it is:
import time
def factorial1(n):
fattoriale = 1
while (n > 0):
fattoriale = fattoriale * n
n = n - 1
return fattoriale
start_time = time.clock()
factorial1(v)
print float(time.clock() - start_time), "seconds"
The strange point (for me) are the results in term of execution time (on a value):
1° run: 0.000301 seconds
2° run: 0.000430 seconds
3° run: 0.000278 seconds
Why do you think it's so variable?
Does it has something to do with the float type approximation?
Thanks, Gianluca
On Unix based systems time.clock returns the CPU time, not the wall-clock time.
Your program is deterministic (even the print is) and on an ideal system should always run in the same amount of time. I believe that in your tests your program was interrupted and some interrupt handler was executed or the scheduler paused your process and gave the CPU to some other process. When your process is allowed to run again the CPU cache might have been filled by the other process, so the processor needs to load your code from memory into the cache again. This takes a small amount of time - which you see in your test.
For a good quantization of how fast your program is you should consider not calling factorial1 only once but thousands of times (or with greater input values). When your program runs for multiple seconds, then scheduling effects have less (relative) impact than in your test where you only tested for less than a millisecond.
It probably has a lot to do with sharing of resources. If your program runs as a separate process, it might have to contend for other processes running on your computer at the same time which are using resources like CPU and RAM. These resources are used by other processes as well so 'acquire' them in terms of concurrent terms will take variable times especially if there are some high-priority processes running parallel to it and other things like interupts may have higher priority.
As for your idea, from what I know, the approximation process should not take variable times as it runs a deterministic algorithm. However the approximation process again may have to contend for the resources.

Measuring execution time of selected loops

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.
The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.
I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:
for (i = 0; i < 1000; ++i)
{
for (j = 0; j < N; ++j)
{
//do some work here
}
}
Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.
Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.
If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.
If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.
This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.
Otherwise, you're just measuring white noise.
Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.
Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.
Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).
I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.
Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

Resources