Parallel Quicksort in Python - algorithm

I would like to implement the Parallel Quicksort in Python.
I know Quicksort, you have to choose a pivot, partition, but how do spawned them as independent task in Python?
Here is the pseudocode for it:
QS(A[1:n])
if n=1 then return A[1]
pivot <--any value from A (random)
L <- A[A[:] < pivot]
R <- A[A[:] > pivot]
A(L) <- spawn QS(L)
A(R) <- QS(R)
sync
return A(L) ++ A(R)

You can do it, but it's unlikely to speed up your code. You can use ThreadPoolExecutor to create a thread and get result back from it. Here is a simple illustration with a function that sums an array:
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)
def add(arr):
if len(arr)<2:
return sum(arr) #cheating a little
mid = len(arr)//2
f = pool.submit(add,arr[:mid])
y = add(arr[mid:])
return y+f.result()
submit() takes a function's name as first argument and then takes the function's arguments. So for your code it'll be something like f = pool.submit(QS,L).
Please remember though Python supports concurrency but not parallelism using thread. Take a look here for their difference. So above code will actually run in single thread. Now you can use ProcessPoolExecutor for process parallelism, which python supports well. But the overhead in data IO will probably eat up any speed up you gain from process parallelism.

Related

In Haskell, is it correct measuring performance using a timestamp obtained at the beginning and in the end of a function execution?

I want to measure the performance of a Haskell function. This function is executed concurrently.
Is it correct to measure its performance using timestamps that getCurrentTime function returns? Does lazyness affects the measuring?
I want to save these times on a log. I have looked some logging libraries, but the time they return is not as precise as the timestamp that getCurrentTime returns. I use XES format on my log.
The code i use is something like this: (i did not compile it)
import Data.Time.Clock
measuredFunction:: Int -> IO (Int,UTCTime,UTCTime)
measuredFunction x = do
time' <- getCurrentTime
--performs some IO action that returns x'
time'' <- getCurrentTime
return (x',time',time'')
runTest :: Int -> Int -> IO ()
runTest init end = do
when (init <= end) (do
forkIO (do
(x',time',time'') <- measuredFunction 1
-- saves time' and time '' in a log
)
runTest (init+1) end )
It depends on the function. Some values have all their information immediately, whereas others can have expensive stuff going on "beyond the top layer". Here's a contrived example:
example :: (Int, Int)
example = (1+1, head [ x | x <- [1..], x == 10^6 ])
If you load this up in ghci, you will see (2, printed, and then after some delay, the remainder of the value 1000000) is printed. If you get a value like this, then the function will "return" "before" the expensive sub-value has been computed. But you can use deepseq to ensure that a value is computed all the way and doesn't have any sub-computations left.
Benchmarking is subtle, and there are a lot of ways to do it wrong (especially in Haskell). Fortunately we have a very good benchmarking library called criterion
(tutorial) which I definitely recommend you use if you are trying to get reliable results.

Scala: branching statement optimisation using callback function degrading performance

It is common in C/C++ programming to use function pointers to optimize branching in the main data path. So I wrote a test program to find out if similar performance savings can be gotten in Scala using functional programming techniques. The usecase is that of a function which is invoked millions of times and has a branching statement based on a global flag. The code using if() statement -
val b = true
def test() = {
if(b) // do something
else // do something else
}
for(i <- 0 to 100000) test()
And trying to get rid of the if() I did this -
def onTrue() = { // do something }
def onFalse() = { // do something else }
lazy val callback: () => Unit = if(b) onTrue else onFalse
def test() = callback()
for(i <- 0 to 100000) test()
I did a comparison of both these programs by running them for large counters (in the for loop) and running them many times and using the System.nanoTime() differential to measure the time taken.
And my tests seem to suggest that the callback method is actually SLOWER than using if() in the loop. The reason for this could be that a function call requires the params and returns to be pushed on the stack and a new stack frame created etc. Given this finding wanted to know -
Is there a functional way one could code which will better the performance of using the if() in the loop with Scala?
#inline works with compiler. Is there a runtime equivalent to avoid the stack activities? (similar to tail call optimization)
Could my test or results be inaccurate/erroneous in some way?
3) It's very easy to get your methodology wrong when testing this way. Use something like JMH if you want quasi-trustable microbenchmarks!
2) The JVM does inlining at runtime.
1) You aren't measuring a difference in whether something is "functional". You're measuring the difference between using a lazy val and not. If you don't have the lazy val in there, the JVM will probably be able to optimize your code (depending on what "do something" is).
If you remove the lazy val, the second one optimizes to the same speed in my hands. (It has an extra mandatory check for every access that it isn't being initialized in a multi-threaded context.)

julia #parallel for loop does not update array

I am new to julia and to get started I wanted to port some numpy code to julia and hoped to get some nice performance increase. So far not to my satisfaction.
This is the function I want to compute
function s(x_list, r_list)
result_list = zeros(size(x_list,1))
for i = 1:size(x_list,1)
dotprods = r_list * x_list[i,:]'
expcall = exp(im * dotprods)
sumprod = sum(expcall) * sum(conj(expcall))
result_list[i] = sumprod
end
return result_list
end
with data input that looks like
v = rand(3)
r = rand(6000,3)
x = linspace(1.0, 2.0, 300) * (v./sqrt(sumabs2(v)))'
for this function and the given input, #time s(x,r) gives me
0.110619 seconds (3.60 k allocations: 96.256 MB, 8.47% gc time)
For this case, numpy does the same job in ~70ms, so I'm not very happy! Now if I do a #parallel for loop with julia -p 2:
function s(x_list, r_list)
result_list = SharedArray(Float64, size(x_list,1))
#parallel for i = 1:size(x_list,1)
dotprods = r_list * x_list[i,:]'
expcall = exp(im * dotprods)
sumprod = sum(expcall) * sum(conj(expcall))
result_list[i] = sumprod
end
return result_list
end
the problem is that
result_list[i] = sumprod
doesn't get updated and I get the list of zeros returned from the array initialization. What am I doing wrong here?
Further attempts to increase speed also did not show any benefit, e.g.
#vectorize_2arg Array{Float64,2} s
and declaring types
function s{T<:Float64}(x_list::Array{T,2}, r_list::Array{T,2})
But now, starting the same #parallel for loop in a session with just one thread (no -p2, just julia) the array does get updated and #time s(x,r) tells me
0.000040 seconds (36 allocations: 4.047 KB)
which is actually impossible for the function and input given! Is this a bug?
Any help is very appreciated!
Julia's #parallel macro does a distributed for loop: it copies all the data to other processes and does computations on each of them, reducing over the results and returning that result. The processes do not share memory – and may even be on other machines altogether. Your original data is never touched because each worker is modifying its own copy of that data. You may be thinking of threads, which is a currently-experimental feature that Julia will be adding in the future.
One problem is that you're not waiting for the #parallel call to complete. From the docs:
...the reduction operator can be omitted if it is not needed. In that case, the loop executes asynchronously, i.e. it spawns independent tasks on all available workers and returns an array of Future immediately without waiting for completion. The caller can wait for the Future completions at a later point by calling fetch() on them, or wait for completion at the end of the loop by prefixing it with #sync, like #sync #parallel for.
Try prefixing for loop with #sync

Julia: Parallel for loop over partitions iterator

So I'm trying to iterate over the list of partitions of something, say 1:n for some n between 13 and 21. The code that I ideally want to run looks something like this:
valid_num = #parallel (+) for p in partitions(1:n)
int(is_valid(p))
end
println(valid_num)
This would use the #parallel for to map-reduce my problem. For example, compare this to the example in the Julia documentation:
nheads = #parallel (+) for i=1:200000000
Int(rand(Bool))
end
However, if I try my adaptation of the loop, I get the following error:
ERROR: `getindex` has no method matching getindex(::SetPartitions{UnitRange{Int64}}, ::Int64)
in anonymous at no file:1433
in anonymous at multi.jl:1279
in run_work_thunk at multi.jl:621
in run_work_thunk at multi.jl:630
in anonymous at task.jl:6
which I think is because I am trying to iterate over something that is not of the form 1:n (EDIT: I think it's because you cannot call p[3] if p=partitions(1:n)).
I've tried using pmap to solve this, but because the number of partitions can get really big, really quickly (there are more than 2.5 million partitions of 1:13, and when I get to 1:21 things will be huge), constructing such a large array becomes an issue. I left it running over night and it still didn't finish.
Does anyone have any advice for how I can efficiently do this in Julia? I have access to a ~30 core computer and my task seems easily parallelizable, so I would be really grateful if anyone knows a good way to do this in Julia.
Thank you so much!
The below code gives 511, the number of partitions of size 2 of a set of 10.
using Iterators
s = [1,2,3,4,5,6,7,8,9,10]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
sum(map(is_valid, takenth(chain(1:29,drop(partitions(s), i-1)), 30)))
end
This solution combines the takenth, drop, and chain iterators to get the same effect as the take_every iterator below under PREVIOUS ANSWER. Note that in this solution, every process must compute every partition. However, because each process uses a different argument to drop, no two processes will ever call is_valid on the same partition.
Unless you want to do a lot of math to figure out how to actually skip partitions, there is no way to avoid computing partitions sequentially on at least one process. I think Simon's answer does this on one process and distributes the partitions. Mine asks each worker process to compute the partitions itself, which means the computation is being duplicated. However, it is being duplicated in parallel, which (if you actually have 30 processors) will not cost you time.
Here is a resource on how iterators over partitions are actually computed: http://www.informatik.uni-ulm.de/ni/Lehre/WS03/DMM/Software/partitions.pdf.
PREVIOUS ANSWER (More complicated than necessary)
I noticed Simon's answer while writing mine. Our solutions seem similar to me, except mine uses iterators to avoid storing partitions in memory. I'm not sure which would actually be faster for what size sets, but I figure it's good to have both options. Assuming it takes you significantly longer to compute is_valid than to compute the partitions themselves, you can do something like this:
s = [1,2,3,4]
is_valid(p) = length(p)==2
valid_num = #parallel (+) for i = 1:30
foldl((x,y)->(x + int(is_valid(y))), 0, take_every(partitions(s), i-1, 30))
end
which gives me 7, the number of partitions of size 2 for a set of 4. The take_every function returns an iterator that returns every 30th partition starting with the ith. Here is the code for that:
import Base: start, done, next
immutable TakeEvery{Itr}
itr::Itr
start::Any
value::Any
flag::Bool
skip::Int64
end
function take_every(itr, offset, skip)
value, state = Nothing, start(itr)
for i = 1:(offset+1)
if done(itr, state)
return TakeEvery(itr, state, value, false, skip)
end
value, state = next(itr, state)
end
if done(itr, state)
TakeEvery(itr, state, value, true, skip)
else
TakeEvery(itr, state, value, false, skip)
end
end
function start{Itr}(itr::TakeEvery{Itr})
itr.value, itr.start, itr.flag
end
function next{Itr}(itr::TakeEvery{Itr}, state)
value, state_, flag = state
for i=1:itr.skip
if done(itr.itr, state_)
return state[1], (value, state_, false)
end
value, state_ = next(itr.itr, state_)
end
if done(itr.itr, state_)
state[1], (value, state_, !flag)
else
state[1], (value, state_, false)
end
end
function done{Itr}(itr::TakeEvery{Itr}, state)
done(itr.itr, state[2]) && !state[3]
end
One approach would be to divide the problem up into pieces that are not too big to realize and then process the items within each piece in parallel, e.g. as follows:
function my_take(iter,state,n)
i = n
arr = Array[]
while !done(iter,state) && (i>0)
a,state = next(iter,state)
push!(arr,a)
i = i-1
end
return arr, state
end
function get_part(npart,npar)
valid_num = 0
p = partitions(1:npart)
s = start(p)
while !done(p,s)
arr,s = my_take(p,s,npar)
valid_num += #parallel (+) for a in arr
length(a)
end
end
return valid_num
end
valid_num = #time get_part(10,30)
I was going to use the take() method to realize up to npar items from the iterator but take() appears to be deprecated so I've included my own implementation which I've called my_take(). The getPart() function therefore uses my_take() to obtain up to npar partitions at a time and carry out a calculation on them. In this case, the calculation just adds up their lengths, because I don't have the code for the OP's is_valid() function. get_part() then returns the result.
Because the length() calculation isn't very time-consuming, this code is actually slower when run on parallel processors than it is on a single processor:
$ julia -p 1 parpart.jl
elapsed time: 10.708567515 seconds (373025568 bytes allocated, 6.79% gc time)
$ julia -p 2 parpart.jl
elapsed time: 15.70633439 seconds (548394872 bytes allocated, 9.14% gc time)
Alternatively, pmap() could be used on each piece of the problem instead of the parallel for loop.
With respect to the memory issue, realizing 30 items from partitions(1:10) took nearly 1 gigabyte of memory on my PC when I ran Julia with 4 worker processes so I expect realizing even a small subset of partitions(1:21) will require a great deal of memory. It may be desirable to estimate how much memory would be needed to see if it would be at all possible before trying such a computation.
With respect to the computation time, note that:
julia> length(partitions(1:10))
115975
julia> length(partitions(1:21))
474869816156751
... so even efficient parallel processing on 30 cores might not be enough to make the larger problem solvable in a reasonable time.

Populating a list is Scala with random double taking forever

I am new to Scala and am trying to get a list of random double values:
The thing is, when I try to run this, it takes way too long compared to its Java counterpart. Any ideas on why this is or a suggestion on a more efficient approach?
def random: Double = java.lang.Math.random()
var f = List(0.0)
for (i <- 1 to 200000)
( f = f ::: List(random*100))
f = f.tail
You can also achieve it like this:
List.fill(200000)(math.random)
the same goes for e.g. Array ...
Array.fill(200000)(math.random)
etc ...
You could construct an infinite stream of random doubles:
def randomList(): Stream[Double] = Stream.cons(math.random, randomList)
val f = randomList().take(200000)
This will leverage lazy evaluation so you won't calculate a value until you actually need it. Even evaluating all 200,000 will be fast though. As an added bonus, f no longer needs to be a var.
Another possibility is:
val it = Iterator.continually(math.random)
it.take(200000).toList
Stream also has a continually method if you prefer.
First of all, it is not taking longer than java because there is no java counterpart. Java does not have an immutable list. If it did, performance would be about the same.
Second, its taking a lot of time because appending lists have linear performance, so the whole thing has quadratic performance.
Instead of appending, prepend, which had constant performance.
if your using mutable state anyways you should use a mutable collection like buffer which you can add too with += (which then would be the real counterpart to java code).
but why dont u use list comprehension?
val f = for (_ <- 1 to 200000) yield (math.random * 100)
by the way: var f = List(0.0) ... f = f.tail can be replaced by var f: List[Double] = Nil in your example. (no more performance but more beauty ;)
Yet more options! Tail recursion:
def randlist(n: Int, part: List[Double] = Nil): List[Double] = {
if (n<=0) part
else randlist(n-1, 100*random :: part)
}
or mapped ranges:
(1 to 200000).map(_ => 100*random).toList
Looks like you want to use Vector instead of List. List has O(1) prepend, Vector has O(1) append. Since you are appending, but using concatenation, it'll be faster to use Vector:
def random: Double = java.lang.Math.random()
var f: Vector[Double] = Vector()
for (i <- 1 to 200000)
f = f :+ (random*100)
Got it?

Resources