Merge sort for GPU - algorithm

Im trying to implement a merge sort using opencl wrapper.
The problem is, each pass needs a different indexing algorithm for threads' memory access.
Some info about this:
First pass(numbers indicate elements and arrows indicate sorting)
0<-->1 2<--->3 4<--->5 6<--->7
group0 group1 group2 group3 ===>1 thread per group, N/2 groups total
Second pass(all parallel)
0<------>2 4<------>6
1<------->3 5<------->7
group0 group1 ========> 2 threads per group, N/4 groups
Next pass
0<--------------->4 8<------------------>12
1<--------------->5 9<------------------->13
2<--------------->6 10<---------------->14
3<--------------->7 11<--------------->15
group0 group1 ===>4 threads per group
but N/8 groups
So, an element of a sub-group cannot make any comparison between another group's element.
I cannot simply do
A[i]<---->A[i+1] or A[i]<---->A[i+4]
because these cover
A[1]<---->A[2] and A[4]<----->A[8]
which are wrong.
I needed a more complex indexing algorithm which has potential to use same number of threads for all passes.
Pass n: global id(i): 0,1, 2,3 4,5 , ... to compare id 0,1 4,5 8,9
looks like compare id1=(i/2)*4+i%2
Pass n+1: global id(i): 0,1,2,3, 4,5,6,7, ... to compare id 0,1,2,3, 8,9,10,11, 16,17
looks like compare id=(i/4)*8+i%4
Pass n+2: global id(i): 0,1,2,3,... 8,9,10,... to compare id 0,1,2,3,... 16,17,18,...
looks like compare id=(i/8)*16+i%8
compare id1 = ( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) )
compare id2 = compare id1 + pow(2,passN)
so, in the kernel string, can it be
int i=get_global_id(0);
int compareId1=( i/( pow(2,passN) ) ) * pow(2,passN+1) + i%( pow(2, passN) );
int compareId2=compareId1+pow(2,passN);
// this can happen only for the first pass
// needs a different kernel structure for that
but Im not sure.
Question: Can you give any directions about which memory access pattern would not leak while satisfying the "no compare between different groups" condition?
Needed to hard-reset my computer many times already, trying different algortihms(memory leaked, black screen, crash, restart), this one is the last and I fear it can crash entire OS.
I already tried a much simpler version with a decreasing number of threads per pass, had some bad performance.
Edit: I tried the upper code, it sorts reversely ordered arrays. But not randomized arrays.


Randomly selecting elements from slices produced by a map in restricted key range in golang. Is there an O(1) shortcut?

In my program to simulate many-particle evolution, I have a map that takes a key value pop (the population size) and returns a slice containing the sites that have this population: myMap[pop][]int. These slices are generically quite large.
At each evolution step I choose a random population size RandomPop. I would then like to randomly choose a site that has a population of at least RandomPop. The sitechosen is used to update my population structures and I utilize a second map to efficiently update myMap keys. My current (slow) implementation looks like
func Evolve( ..., myMap map[int][]int ,...){
RandomPop = rand.Intn(rangeofpopulation)+1
for i:=RandPop,; i<rangeofpopulation;i++{
randomindex:= rand.Intn(len(preallocatedslice))
sitechosen= preallocatedslice[randomindex]
//reset preallocated slice
This code (obviously) hits a huge bottle-neck when copying values from the map to preallocatedslice, with runtime.memmove eating 87% of my CPU usage. I'm wondering if there is an O(1) way to randomly choose an entry contained in the union of slices indicated by myMap with key values between 0 and RandomPop ? I am open to packages that allow you to manipulate custom hashtables if anyone is aware of them. Suggestions don't need to be safe for concurrency
Other things tried: I previously had my maps record all sites with values of at least pop but that took up >10GB of memory and was stupid. I tried stashing pointers to the relevant slices to make a look-up slice, but go forbids this. I could sum up the lengths of each slice and generate a random number based on this and then iterate through the slices in myMap by length, but this is going to be much slower than just keeping an updated cdf of my population and doing a binary search on it. The binary search is fast, but updating the cdf, even if done manually, is O(n). I was really hoping to abuse hashtables to speed up random selection and update if possible
A vague thought I have is concocting some sort of nested structure of maps pointing to their contents and also to the map with a key one less than theirs or something.
I was looking at your code and I have a question.
Why do you have to copy values from the map to the slice? I mean, I think that I am following the logic behind... but I wonder if there is a way to skip that step.
So we have:
func Evolve( ..., myMap map[int][]int ,...){
RandomPop = rand.Intn(rangeofpopulation)+1
for i:=RandPop,; i<rangeofpopulation;i++{
// slice of preselected `sites`. one of this will be 'siteChosen'
// we expect to have `n sites` on `preAllocatedSlice`
// where `n` is the amount of iterations,
// ie; n = rangeofpopulation - RandPop
// Once we have a list of sites, we select `one`
// under a normal distribution every site ha a chance of 1/n to be selected.
randomindex:= rand.Intn(len(preallocatedslice))
sitechosen= preallocatedslice[randomindex]
But what if we change that to:
func Evolve( ..., myMap map[int][]int ,...){
if len(myMap) == 0 {
// Nothing to do, print a log!
// This variable will hold our site chosen!
var siteChosen []int
// Our random population size is a value from 1 to rangeOfPopulation
randPopSize := rand.Intn(rangeOfPopulation) + 1
for i := randPopSize; i < rangeOfPopulation; i++ {
// We are going to pretend that the current candidate is the siteChosen
siteChosen = myMap[i]
// Now, instead of copying `myMap[i]` to preAllocatedSlice
// We will test if the current candidate is actually the 'siteChosen` here:
// We know that the chances for an specific site to be the chosen is 1/n,
// where n = rangeOfPopulation - randPopSize
n := float64(rangeOfPopulation - randPopSize)
// we roll the dice...
isTheChosenOne := rand.Float64() > 1/n
if isTheChosenOne {
// If the candidate is the Chosen site,
// then we don't need to iterate over all the other elements.
// here we know that `siteChosen` is a.- a selected candidate, or
// b.- the last element assigned in the loop
// (in the case that `isTheChosenOne` was always false [which is a probable scenario])
Also if you want to can calculate n, or 1/n outside the loop.
So the idea is testing inside the loop if the candidate is the siteChosen, and avoid copying the candidates to this preselection pool.

Input to different attributes values from a random.sample list

so this is what I'm trying to do, and I'm not sure how cause I'm new to python. I've searched for a few options and I'm not sure why this doesn't work.
So I have 6 different nodes, in maya, called aiSwitch. I need to generate random different numbers from 0 to 6 and input that value in the aiSiwtch*.index.
In short the result should be
aiSwitch1.index = (random number from 0 to 5)
aiSwitch2.index = (another random number from 0 to 5 different than the one before)
And so on unil aiSwitch6.index
I tried the following:
import maya.cmds as mc
import random
allswtich ='aiSwitch*')
for i in allswitch:
print i
S = range(0,6)
print S
shuffle = random.sample(S, len(S))
print shuffle
for w in shuffle:
print w
mc.setAttr(i + '.index', w)
This is the result I get from the prints:
aiSwitch1 <-- from print i
[0,1,2,3,4,5] <--- from print S
[2,3,5,4,0,1] <--- from print Shuffle (random.sample results)
1 <--- from print w, every separated item in the random.sample list.
Now, this happens for every aiSwitch, cause it's in a loop of course. And the random numbers are always a different list cause it happens every time the loop runs.
So where is the problem then?
aiSwitch1.index = 1
And all the other aiSwitch*.index always take only the last item in the list but the time I get to do the setAttr. It seems to be that w is retaining the last value of the for loop. I don't quite understand how to
Get a random value from 0 to 5
Input that value in aiSwitch1.index
Get another random value from 0 to 6 different to the one before
Input that value in aiSwitch2.index
Repeat until aiSwitch5.index.
I did get it to work with the following form:
allSwitch ='aiSwitch')
for i in allSwitch:
mc.setAttr(i + '.index', random.uniform(0,5))
This gave a random number from 0 to 5 to all aiSwitch*.index, but some of them repeat. I think this works cause the value is being generated every time the loop runs, hence setting the attribute with a random number. But the numbers repeat and I was trying to avoid that. I also tried a shuffle but failed to get any values from it.
My main mistake seems to be that I'm generating a list and sampling it, but I'm failing to assign every different item from that list to different aiSwitch*.index nodes. And I'm running out of ideas for this.
Any clues would be greatly appreciated.
Here is a somewhat Pythonic way: shuffle the list of indices, then iterate over it using zip (which is useful for iterating over structures in parallel, which is what you need to do here):
import random
index = list(range(6))
allSwitch ='aiSwitch*')
for i,j in zip(allSwitch,index):
mc.setAttr(i + '.index', j)

Split test groups base on GUID

Users in the system are identified by GUID, and with a new feature, I want to divide users into two groups - test and control.
Is there a easy way to split users into one of the two group with a 50/50 chance, based on their GUID?
e.g. If the nth character's ascii code is an odd -> test group, otherwise control group.
What about 70/30, or other ratio?
The reason I want to classify users base on GUID, is because later I can easily tell which users are in which group and compare the performance between two groups, without having to keep track of the group assignment - I simply need to calculate it again.
As Derek Li notes, the GUID's bits might be based on a timestamp, so you shouldn't use them directly.
The safest solution is to hash the GUID using a hash function like MurmurHash. This will produce a random number (but the same random number every time for any given GUID) which you can then use to do the split.
For example, you could do a 30/70 split like this:
function isInTestGroup(user) {
var hash = murmurHash(user.guid);
return (hash % 100) < 30;
If some character in the GUID has a 1 in 16 change of being one of the following characters: "0123456789ABCEDF", then perhaps you could test a scheme that determines placement by that character.
Say the last character of the guid called c has a 1/16 chance of being any hex digit:
for 50/50 distribution -> c <= 7 for group 1, c > 7 for group 2
for 70/30 c <= A for group 1, c > A for group 2

Reducer that groups by two values

I have a case in which Mapper emits data that belongs to a subgroup and the subgroup belongs to a group.
I need to add up all the values in the subgroup and find the minimal value between all subgroups of the group, for each of the groups.
So, I have an output from Mapper that looks like this
Group 1
Group 2
And my output should be
Group1, 1, (2+3+4)
Group1, 2, (1+2)
Group1, 3, (1+2+5)
Group1 min = min((2+3+4),(1+2),(1+2+5))
Same for Group 2.
So I practically need to group twice, first group by GROUP and then inside of it group by SUBGROUPID.
So I should emit the minimal sum from a group, in the given example my reducer should emit (2,3), since the minimal sum is 3 and it comes from element with id 2.
So, it seems that it could be solved best using reduce twice, first reduce would get elements grouped by id and that would be passed to the second Reducer grouped by Group id.
Does this make sense and how to implement it? I've seen ChainedMapper and ChainedReducer, but they don't fit for this purpose.
If all data can fit in the memory of one machine, you can simply do all this in a single job, using a single reducer (job.setNumReducers(1);) and two temp variables. The output is emitted in the cleanup phase of the reducer. Here is the pseudocode for that, if you use the new Hadoop API (that supports the cleanup() method):
int tempKey;
int tempMin;
setup() {
tempMin = Integer.MAX_VALUE;
reduce(key, values) {
int sum = 0;
while (values.hasNext()) {
sum +=;
if (sum < tempMin) {
tempMin = sum;
tempKey = key;
cleanup() { //only in the new API
emit(tempKey, tempMin);
Your approach (summarized below), is how I would do it.
Job 1:
Mapper: Assigns an id to a subgroupid
Combiner/Reducer(same class): Finds the minimum value for
Job 2:
Mapper: Assigns a groupid to a subgroupid.
Combiner/Reducer(same class): Finds the minimum value for
This is best implemented in two jobs for the following reasons:
Simplifies the mapper and reducer significantly (you don't need to worry about finding all the groupids the first time around). Finding the (groupid, subgroupid) pairs in the mapper could be non-trivial. Writing the two mappers should be trivial.
Follows the map reduce programming guidelines given by Tom White in Hadoop: The Definitive Guide (Chapter 6).
An Oozie workflow can easily and simply accommodate the dependent jobs.
The intermediate file products (key:subgroupid, value: min value for subgroupid) should be small, limiting the use of network resources.

Algorithm to create unique random concatenation of items

I'm thinking about an algorithm that will create X most unique concatenations of Y parts, where each part can be one of several items. For example 3 parts:
part #1: 0,1,2
part #2: a,b,c
part #3: x,y,z
And the (random, one case of some possibilities) result of 5 concatenations:
0bz (note that '0by' would be "less unique " than '0bz' because 'by' already was)
2ay (note that 'a' didn't after '2' jet, and 'y' didn't after 'a' jet)
Simple BAD results for next concatenation:
1cy ('c' wasn't after 1, 'y' wasn't after 'c', BUT '1'-'y' already was as first-last
Simple GOOD next result would be:
0cy ('c' wasn't after '0', 'y' wasn't after 'c', and '0'-'y' wasn't as first-last part)
I know that this solution limit possible results, but when all full unique possibilities will gone, algorithm should continue and try to keep most avaible uniqueness (repeating as few as possible).
Consider real example:
And I want results like:
Boy get milk
Martin stole bottle
Girl bought water
Boy bought bottle (not water, because of 'bought+water' and not milk, because of 'Boy+milk')
Maybe start with a tree of all combinations, but how to select most unique trees first?
Edit: According to this sample data, we can see, that creation of fully unique results for 4 words * 3 possibilities, provide us only 3 results:
Martin stole a bootle
Boy bought an milk
He get hard water
But, there can be more results requested. So, 4. result should be most-available-uniqueness like Martin bought hard milk, not Martin stole a water
Edit: Some start for a solution ?
Imagine each part as a barrel, wich can be rotated, and last item goes as first when rotates down, first goes as last when rotating up. Now, set barells like this:
Martin|stole |a |bootle
Boy |bought|an |milk
He |get |hard|water
Now, write sentences as We see, and rotate first barell UP once, second twice, third three and so on. We get sentences (note that third barell did one full rotation):
Boy |get |a |milk
He |stole |an |water
And we get next solutions. We can do process one more time to get more solutions:
He |bought|a |water
Martin|get |an |bootle
Boy |stole |hard|milk
The problem is that first barrel will be connected with last, because rotating parallel.
I'm wondering if that will be more uniqe if i rotate last barrel one more time in last solution (but the i provide other connections like an-water - but this will be repeated only 2 times, not 3 times like now). Don't know that "barrels" are good way ofthinking here.
I think that we should first found a definition for uniqueness
For example, what is changing uniqueness to drop ? If we use word that was already used ? Do repeating 2 words close to each other is less uniqe that repeating a word in some gap of other words ? So, this problem can be subjective.
But I think that in lot of sequences, each word should be used similar times (like selecting word randomly and removing from a set, and after getting all words refresh all options that they can be obtained next time) - this is easy to do.
But, even if we get each words similar number od times, we should do something to do-not-repeat-connections between words. I think, that more uniqe is repeating words far from each other, not next to each other.
Anytime you need a new concatenation, just generate a completely random one, calculate it's fitness, and then either accept that concatenation or reject it (probabilistically, that is).
const C = 1.0
function CreateGoodConcatenation()
for (rejectionCount = 0; ; rejectionCount++)
candidate = CreateRandomConcatination()
fitness = CalculateFitness(candidate) // returns 0 < fitness <= 1
r = GetRand(zero to one)
adjusted_r = Math.pow(r, C * rejectionCount + 1) // bias toward acceptability as rejectionCount increases
if (adjusted_r < fitness)
return candidate
CalculateFitness should never return zero. If it does, you might find yourself in an infinite loop.
As you increase C, less ideal concatenations are accepted more readily.
As you decrease C, you face increased iterations for each call to CreateGoodConcatenation (plus less entropy in the result)
