Find Distinct Element Counts from Array in CUDA [duplicate]

Find Distinct Element Counts from Array in CUDA [duplicate] - parallel-processing

I'm very new to Cuda, I've read a few chapters from books and read a lot of tutorials online. I have made my own implementations on vector addition and multiplication.
I would like to move a little further, so let's say we want to implement a function that takes as an input a sorted array of integers.
Our goal is to find the frequencies of each integer that is in the array.
Sequentially we could scan the array one time in order to produce the output. The time complexity would be O(n).
Since the groups are different, I guess it must be possible to take advantage of CUDA.
Suppose this is the array
1
1
1
1
2
2
3
3
5
5
6
7
In order to achieve full parallelism, each thread would have to know exactly which part of the array it has to scan in order to find the sum. This can only be achieved if we use another array called int dataPosPerThread[] which for each thread id the dataPosPerThread[threadId] would have as value the starting position on the initial array. So, that would mean that each thread would know where to start and where to finish.
However in this way we won't gain anything, because it would take us O(n) time in order to find the positions. Eventually the total cost would be O(n) + cost_to_transfer_the_data_to_the_gpu + O(c) + cost_to_transfer_the_results_to_the_gpu
where O(c) is the constant time it would take for the threads to find the final output, assuming of course that we have many different integers inside the initial array.
I would like to avoid the extra O(n) cost.
What I've thought so far is, having an array of size arraySize, we specify the total amount of threads that will be used, let's say totalAmountOfThreads which means that each thread will have to scan totalAmountOfThreads/arraySize values.
The first thread(id 0) would start scanning from position 0 until position totalAmountOfThreads/arraySize.
The second thread would start from totalAmountOfThreads/arraySize + 1 and so on.
The problem is though that some thread might be working with different integer groups or with one group that has more values being processed by other threads. For instance in the above example if we suppose that we will have 6 threads, each thread will take 2 integers of the array, so we will have something like this:
1 <-------- thread 0
1
1 <-------- thread 1
1
2 <-------- thread 2
2
3 <-------- thread 3
3
5 <-------- thread 4
5
6 <-------- thread 5
7
As you can see thread 0 has only 1 values, however there are other 1 values that are being processed by thread 2. In order to achieve parallelism though, these threads have to be working on unrelated data. Assuming that we will use this logic, each thread will compute the following results:
thread 0 => {value=1, total=2}
thread 1 => {value=1, total=2}
thread 2 => {value=2, total=2}
thread 3 => {value=3, total=2}
thread 4 => {value=5, total=2}
thread 5 => {{value=6, total=1}, {value=7, total=1}}
By having this result what can be further achieved? Someone could suggest using an extra hash_map, like unordered_map which can efficiently update for each value computed by a single thread the total variable. However
Unordered_map is not supported by cuda compiler
This would mean that the threads would not be able to take advantage of shared memory because two threads from different blocks could be working with the same values, so the hash map would have to be in the global memory.
Even if the above two weren't a problem, we would still have race conditions between threads when updating the hash map.
What would be a good way in order to approach this problem?
Thank you in advance

As #tera has already pointed out, what you're describing is a histogram.
You may be interested in the thrust histogram sample code. If we refer to the dense_histogram() routine as an example, you'll note the first step is to sort the data.
So, yes, the fact that your data is sorted will save you a step.
In a nutshell we are:
sorting the data
marking the boundaries of different elements within the data
computing the distance between the boundaries.
As shown in the sample code, thrust can do each of the above steps in a single function. Since your data is sorted you can effectively skip the first step.

Related

Combinatorics: variable accessed by two threads

Threads A and B have concurrent access to a single variable. Each thread performs a sequence of accesses on the variable (reads and writes). Each thread stores the results from its reads into an array. The outcome of the session is defined by the two arrays.
The accesses performed by a given thread may not be re-ordered. However, accesses from the two threads may be interleaved, so the outcome will depend on this interleaving. How can we efficiently calculate the number of possible outcomes, given the two access sequences? Assume all writes produce distinct values.
Example access sequences:
Thread A: [write(71), read()]
Thread B: [read(), write(72), write(73), read()]
Example interleaving:
[a_write(71), b_read(), b_write(72), a_read(), b_write(73), b_read()]
Example outcome:
a_results = [72]
b_results = [71, 73]
P.s. This is not homework, it's just a problem I conceived myself.

This looks like something that could be solved with dynamic programming.
I would suggest looking for a way of solving the subproblem:
How many distinct outcomes are there given that we have done x accesses from thread 1, y accesses from thread 2, and the last access was a write that was done by thread z (either 1 or 2).
The DP array will be 3 dimensional: DP[x][y][z].
There will be a total of 2 * (number of accesses in thread 1) * (number of accesses in thread 2) slots to be calculated in the DP.
To populate each entry in the array you will need to sum several previous entries of the array so I suspect the overall complexity will be around O(n^3) where n is the number of accesses.

algorithm: is there a map-reduce way to merge a group of sets by deleting all the subsets

The problem is: Suppose we have a group of Sets: Set(1,2,3) Set(1,2,3,4) Set(4,5,6) Set(1,2,3,4,6), we need to delete all the subsets and finally get the Result: Set(4,5,6) Set(1,2,3,4,6). (Since both Set(1,2,3) and Set(1,2,3,4) are the subsets of Set(1,2,3,4,6), both are removed.)
And suppose that the elements of the set have order, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set], I firstly do data.collect(), which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]], which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.

I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
in this problem each set has to essentially be compared to all of the sets of larger cardinality whose min and max elements lie around the min and max of the smaller set.
unlike sorting and other problems amenable to map-reduce, we don't have a transitivity relation, i.e., if A is not-a-subset-of B and B is-not-subset-of C, we cannot make any statement about A w.r.t. C.
Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.

Since subset-of is a transitive relation (proof), you could take advantage of that and design an iterative algorithm that eliminates subsets in each iteration.
The logic is the following:
Mapper:
eliminate local subsets and emit only the supersets. Let the key be the first element of each superset.
Reducer:
eliminate local subsets and emit only the supersets.
You could also use a combiner with the same logic.
Each time, the number of reducers should decrease, until, in the last iteration, a single reducer is used. This way, you can define from the beginning the number of iterations. E.g. by setting initially 8 reducers, and each time using half of them in the next iteration, your program will terminate after 4 iterations (8 reducers, then 4, then 2 and then 1). In general, it will terminate in logn + 1 iterations (log base 2), where n is the initial number of reducers, so n should be a power of 2 and of course less than the number of mappers. If this feels restrictive, you can think of more drastic decreases in the number of reducers (e.g. decrease by 1/4, or more).
Regarding the choice of the key, this can create balancing issues, if, for example, most of the sets start with the same element. So, perhaps you could also make use of other keys, or define a partitioner to better balance the load. This policy makes sure, though, that sets that are equal will be eliminated as early as possible.
If you have MapReduce v.2, you could implement the aforementioned logic like that (pseudocode):
Mapper:
Set<Set> superSets;
setup() {
superSets = new HashSet<>();
}
map(inputSet){
Set toReplace = null;
for (Set superSet : superSets) {
if (superSet.contains(inputSet) {
return;
}
if (inputSet.contains(superSet)) {
toReplace = superSet;
break;
}
}
if (toReplace != null) {
superSets.remove(toReplace);
}
superSets.add(inputSet);
}
close() {
for (Set superSet : superSets) {
context.write(superSet.iterator.next(), superSet);
}
}
You can use the same code in the reducer and in the combiner.
As a final note, I doubt that MapReduce is the right environment for this kind of computations. Perhaps Apache Spark, or Apache Flink offer some better alternatives.

If I understand:
your goal is to detect and remove subset of set in a large set of sets
there are too sets to be managed altogether (memory limit)
strategy is map and reduce (or some sort of)
What I take into account:
main problem is that you can not managed everything at same time
usual method map/reduce supposes to split datas, and treat each part. This is not done totally like that
(because each subset can intersect with each subset).
If I make some calculations:
suppose you have a large set : 1000 000 of 3 to 20 numbers from 1 to 100.
you should have to compare 1000 Billions couples of sets
Even with 100 000 (10 billions), it takes too much times (I stopped).
What I propose (test with 100 000 sets) :
1 define a criterion to split in more little compatible sets. Compatible sets are packages of sets, and you are sure subsets of sets are at least in one same package: then you are sure to find subsets to remove with that method. Say differently: if set A is a subset of set B, then A and B will reside in one (or several) packages like that.
I just take: every subset which contains one defined element (1, 2, 3, ...) => it gives approximately 11 500 sets with precedent assumptions.
It become reasonable to compare (120 000 comparisons).
It takes 180 seconds on my machine, and it found 900 subsets to remove.
you have to do it 100 times (then 18 000 seconds).
and of course, you can find duplicates (but not too many: some %, and the gool is to eliminate).
2 At end it is easy and fast to agglomerate. Duplicate work is light.
3 bigger filters:
with a filter with 2 elements, you reduce to 1475 sets => you get approximately 30 sets to delete, it takes 2-3 seconds
and you have to do that 10 000
Interest of this method:
the selection of sets on the criterion is linear and very simple. It is also hiearchical:
split on one element , on a second, etc.
it is stateless: you can filter millions of set. You only have to keep the good one. The more datas you have,
the more filter you have to do => solution is scalable.
if you want to treat little clouds, you can takes 3, 4 elements in common, etc.
like that, you can spread your treatment among multiple machines (as many as you have).
At the end, you have to reconciliate all your datas/deleting.
This solution doesnt keep a lot of time overall (you can do calculations), but it suits the need of splitting the work.
Hope it helps.

Store data relationships on a file for very fast read?

I am tasked with finding a solution to the following real word problem and I am really puzzled on how I can solve it.
We have 100 million numbers and 1 billion arrays of numbers (each array can hold up to 1.000 unique numbers).
We pick 1.000 random numbers. We are trying to find the IDs of arrays containing more than 1 of our 1.000 numbers. If there are more than 10.000 such arrays we need the 1st 10.000 only.
In a file for each number we store the IDs of arrays that number appears on. We can solve the problem by reading all the array IDs for every number and processing them. But those IDs are 8 bytes each, so we need to read 8*1 billion = 8GB of data per number if our number appears on every array. If we take the worse case scenario we need to read from the HDD 8GB*1.000 = 8TB. This takes days, not 1 second.
Question: How can I do this in 1 second (or a few seconds) instead of days?
Hint: Seems like my problem is similar to problems the search engines face. I have no experience on that field but someone who has can be really helpful here.

Greedy Algorithm Optimization

Consider a DVR recorder that has the duty to record television programs.
Each program has a starting time and ending time.
The DVR has the following restrictions:
It may only record up to two items at once.
If it chooses to record an item, it must record it from start to end.
Given the the number of television programs and their starting/ending times, what is the maximum number of programs the DVR can record?
For example: Consider 6 programs:
They are written in the form:
a b c. a is the program number, b is starting time, and c is ending time
1 0 3
2 6 7
3 3 10
4 1 5
5 2 8
6 1 9
The optimal way to record is have programs 1 and 3 recorded back to back, and programs 2 and 4 recorded back to back. 2 and 4 will be recording alongside 1 and 3.
This means the max number of programs is 4.
What is an efficient algorithm to find the max number of programs that can be recorded?

This is a classic example for a greedy algorithm.
You create an array with tuples for each program in the input.
Now you sort this array by the end times and start going from the left to the right. If you can take the very next program (you are recording at most one program already), you increment the result counter and remember the end-time. For another program again fill the available slot if possible, if not, you can't record it and can discard it.
This way you will get the maximum number of programs that can be recorded in O(nlogn) time.

How to implement a prioritised collection

I need to implement a prioritised collection.
Assuming we have the following three values in the collection (and their priorities):
thisIsUrgent = Priority.High
thisIsImportant = Priority.Medium
thisIsBoring = Priority.Low
I want to use MoveNext() to go through the collection to get another value.
Assuming I'm looping ten times, each time printing the value from MoveNext(), the desired output is:
thisIsUrgent
thisIsImportant
thisIsUrgent
thisIsUrgent
thisIsBoring
thisIsImportant
thisIsUrgent
thisIsUrgent
thisIsImportant
thisIsBoring
So basically, I get five high priority values, three normal and one low.
Any ideas?

the simplest approach is to have 3 collections behind one interface. When calling MoveNext check the one with highest priority, if there are messges return them until queue gets empty. Then lower and lower. Then you can improve algorithm of picking next queue, for example implement probabilistic one.
In your particular case you should use probabilistic scheduling.
Urgent has 5/10 = 0.5
Medium has 0.3
Low has 0.2
At each turn gerate random number in range [0; 1]. if value falls into [0; 0,5] then pick from Urgent queue, if into [0,5; 0,8] then Medium, [0,8; 1] -> Low;

What you need is:
http://en.wikipedia.org/wiki/Priority_queue

Use 3 collections, one for each priority as Andrey says.
Then, when you want to get the next task, pick a random number between 1 and 9.
Retrieve the next task from the relevant collection as follows:
1 to 5: High priority
6 to 8: Normal priority
9 : Low priority

Perhaps you can clarify your question a little better; it's not immediately obvious why you need your output to be in that particular order. Is the data in one collection, or are there multiple collections?
But if you're looking for a data structure that implements priorities, I suggest going with the tried-and-true Priority Queue.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio