Threads A and B have concurrent access to a single variable. Each thread performs a sequence of accesses on the variable (reads and writes). Each thread stores the results from its reads into an array. The outcome of the session is defined by the two arrays.
The accesses performed by a given thread may not be re-ordered. However, accesses from the two threads may be interleaved, so the outcome will depend on this interleaving. How can we efficiently calculate the number of possible outcomes, given the two access sequences? Assume all writes produce distinct values.
Example access sequences:
Thread A: [write(71), read()]
Thread B: [read(), write(72), write(73), read()]
Example interleaving:
[a_write(71), b_read(), b_write(72), a_read(), b_write(73), b_read()]
Example outcome:
a_results = [72]
b_results = [71, 73]
P.s. This is not homework, it's just a problem I conceived myself.
This looks like something that could be solved with dynamic programming.
I would suggest looking for a way of solving the subproblem:
How many distinct outcomes are there given that we have done x accesses from thread 1, y accesses from thread 2, and the last access was a write that was done by thread z (either 1 or 2).
The DP array will be 3 dimensional: DP[x][y][z].
There will be a total of 2 * (number of accesses in thread 1) * (number of accesses in thread 2) slots to be calculated in the DP.
To populate each entry in the array you will need to sum several previous entries of the array so I suspect the overall complexity will be around O(n^3) where n is the number of accesses.
Related
I'm very new to Cuda, I've read a few chapters from books and read a lot of tutorials online. I have made my own implementations on vector addition and multiplication.
I would like to move a little further, so let's say we want to implement a function that takes as an input a sorted array of integers.
Our goal is to find the frequencies of each integer that is in the array.
Sequentially we could scan the array one time in order to produce the output. The time complexity would be O(n).
Since the groups are different, I guess it must be possible to take advantage of CUDA.
Suppose this is the array
1
1
1
1
2
2
3
3
5
5
6
7
In order to achieve full parallelism, each thread would have to know exactly which part of the array it has to scan in order to find the sum. This can only be achieved if we use another array called int dataPosPerThread[] which for each thread id the dataPosPerThread[threadId] would have as value the starting position on the initial array. So, that would mean that each thread would know where to start and where to finish.
However in this way we won't gain anything, because it would take us O(n) time in order to find the positions. Eventually the total cost would be O(n) + cost_to_transfer_the_data_to_the_gpu + O(c) + cost_to_transfer_the_results_to_the_gpu
where O(c) is the constant time it would take for the threads to find the final output, assuming of course that we have many different integers inside the initial array.
I would like to avoid the extra O(n) cost.
What I've thought so far is, having an array of size arraySize, we specify the total amount of threads that will be used, let's say totalAmountOfThreads which means that each thread will have to scan totalAmountOfThreads/arraySize values.
The first thread(id 0) would start scanning from position 0 until position totalAmountOfThreads/arraySize.
The second thread would start from totalAmountOfThreads/arraySize + 1 and so on.
The problem is though that some thread might be working with different integer groups or with one group that has more values being processed by other threads. For instance in the above example if we suppose that we will have 6 threads, each thread will take 2 integers of the array, so we will have something like this:
1 <-------- thread 0
1
1 <-------- thread 1
1
2 <-------- thread 2
2
3 <-------- thread 3
3
5 <-------- thread 4
5
6 <-------- thread 5
7
As you can see thread 0 has only 1 values, however there are other 1 values that are being processed by thread 2. In order to achieve parallelism though, these threads have to be working on unrelated data. Assuming that we will use this logic, each thread will compute the following results:
thread 0 => {value=1, total=2}
thread 1 => {value=1, total=2}
thread 2 => {value=2, total=2}
thread 3 => {value=3, total=2}
thread 4 => {value=5, total=2}
thread 5 => {{value=6, total=1}, {value=7, total=1}}
By having this result what can be further achieved? Someone could suggest using an extra hash_map, like unordered_map which can efficiently update for each value computed by a single thread the total variable. However
Unordered_map is not supported by cuda compiler
This would mean that the threads would not be able to take advantage of shared memory because two threads from different blocks could be working with the same values, so the hash map would have to be in the global memory.
Even if the above two weren't a problem, we would still have race conditions between threads when updating the hash map.
What would be a good way in order to approach this problem?
Thank you in advance
As #tera has already pointed out, what you're describing is a histogram.
You may be interested in the thrust histogram sample code. If we refer to the dense_histogram() routine as an example, you'll note the first step is to sort the data.
So, yes, the fact that your data is sorted will save you a step.
In a nutshell we are:
sorting the data
marking the boundaries of different elements within the data
computing the distance between the boundaries.
As shown in the sample code, thrust can do each of the above steps in a single function. Since your data is sorted you can effectively skip the first step.
My question briefly stated: Is there an algorithm one can use to divide key value pairs into roughly equal length lists if one doesn't know apriori the number of values that any key contains, and one can't hold all keys (or counts of their values) in RAM concurrently?
My question with context: I have multiple files that contain key/value pairs, where keys are hashes and values are lists of object ids in which the given hash occurs. The same key appears zero or one times in each of these files, and frequently a given key appears in many of the files.
I am reading those files into several workers running in a compute cluster. Each worker is assigned a subset of the keys. For each key a worker is assigned, the worker accumulates all of the values for the key that occur in any of the previously mentioned key/value files. Each worker then reads all of the previously-mentioned files, finds all values for each of its keys, and writes a single output file to disk.
The trouble I'm facing is that the workers are accumulating wildly different numbers of values among their assigned keys, so their RAM requirements are quite different (from 33GB on the low end to 139GB on the high). Right now, to assign keys to workers, I take a sha1 hash of each key, and if sha1(key) % total_number_of_workers == worker_id (where worker id is a given worker's index position among all workers) then the worker is assigned the given key.
Is there a way to assign keys to workers that will help ensure a more equal distribution of RAM requirements among the nodes? Any advice others can offer on this question would be greatly appreciated!
In case it might be of interest to others, I put together a simple implementation of a k-way merge that Jim Mischel describes below in Python [gist]. This implementation doesn't require one to have all text files in memory concurrently, which may be impossible for large datasets.
It's a simple k-way merge. Let's say you have three files:
File 1 File 2 File 3
A=3 B=7 C=22
X=9 B=4 D=19
Q=33 Z=26 A=2
X=47 X=12 D=13
Now, you sort those files:
Sorted1 Sorted2 Sorted3
A=3 B=7 A=2
Q=33 B=4 C=22
X=9 X=12 D=19
X=47 Z=26 D=13
You could do a merge step and end up with a single file:
A=3
A=2
B=7
B=4
C=22
D=19
D=13
Q=33
X=9
X=47
X=12
Z=26
And then scan through that file, accumulating and writing values.
But you can do the merge and accumulation in a single step. After all, when you do the merge you're outputting things in sorted key order, so all you have to do is insert the accumulation code before the output step.
A single process starts up and creates a priority queue that contains the first item from each file. So the priority queue would contain [A=3, B=7, A=2]. The program takes the smallest key, A=3, from the priority queue, and the queue is refreshed with the next item from the first sorted file. The queue now contains [Q=33,B=7,A=2].
The program creates a new array with key A, containing the value [3]. Then it goes to the queue again and reads the smallest value: A=2. It sees that the key is equal to the one it's working on, so it updates the array to [3,2]. The queue is refreshed from the sorted file, so now it contains [Q=33,B=7,C=22].
Once again, the program gets the smallest key value from the queue. This time it's B. B is not equal to A, so the program outputs A,[3,2], replaces the current key with B, and replaces the accumulation array with [7].
This continues until there are no more items to be merged.
The code to handle refilling the priority queue is a bit fiddly, but not really difficult.
An alternative is to use your operating system's sort utility to sort and merge the files, and then write a simple loop that goes through the single sorted file linearly to accumulate the values.
The problem is: Suppose we have a group of Sets: Set(1,2,3) Set(1,2,3,4) Set(4,5,6) Set(1,2,3,4,6), we need to delete all the subsets and finally get the Result: Set(4,5,6) Set(1,2,3,4,6). (Since both Set(1,2,3) and Set(1,2,3,4) are the subsets of Set(1,2,3,4,6), both are removed.)
And suppose that the elements of the set have order, which can be Int, Char, etc.
Is it possible to do it in a map-reduce way?
The reason to do it in a map-reduce way is that sometimes the group of Sets has a very large size, which makes it not possible to do it in the memory of a single machine. So we hope to do it in a map-reduce way, it may be not very efficient, but just work.
My problem is:
I don't know how to define a key for the key-value pair in the map-reduce process to group Sets properly.
I don't know when the process should be finished, that all the subsets have been removed.
EDIT:
The size of the data will keep growing larger in the future.
The input can be either a group of sets or multiple lines with each line containing a group of sets. Currently the input is val data = RDD[Set], I firstly do data.collect(), which results in an overall group of sets. But I can modify the generation of the input into a RDD[Array[Set]], which will give me multiple lines with each line containing a group of sets.
The elements in each set can be sorted by modifying other parts of the program.
I doubt this can be done by a traditional map-reduce technique which is essentially a divide-and-conquer method. This is because:
in this problem each set has to essentially be compared to all of the sets of larger cardinality whose min and max elements lie around the min and max of the smaller set.
unlike sorting and other problems amenable to map-reduce, we don't have a transitivity relation, i.e., if A is not-a-subset-of B and B is-not-subset-of C, we cannot make any statement about A w.r.t. C.
Based on the above observations this problem seems to be similar to duplicate detection and there is research on duplicate detection, for example here. Similar techniques will work well for the current problem.
Since subset-of is a transitive relation (proof), you could take advantage of that and design an iterative algorithm that eliminates subsets in each iteration.
The logic is the following:
Mapper:
eliminate local subsets and emit only the supersets. Let the key be the first element of each superset.
Reducer:
eliminate local subsets and emit only the supersets.
You could also use a combiner with the same logic.
Each time, the number of reducers should decrease, until, in the last iteration, a single reducer is used. This way, you can define from the beginning the number of iterations. E.g. by setting initially 8 reducers, and each time using half of them in the next iteration, your program will terminate after 4 iterations (8 reducers, then 4, then 2 and then 1). In general, it will terminate in logn + 1 iterations (log base 2), where n is the initial number of reducers, so n should be a power of 2 and of course less than the number of mappers. If this feels restrictive, you can think of more drastic decreases in the number of reducers (e.g. decrease by 1/4, or more).
Regarding the choice of the key, this can create balancing issues, if, for example, most of the sets start with the same element. So, perhaps you could also make use of other keys, or define a partitioner to better balance the load. This policy makes sure, though, that sets that are equal will be eliminated as early as possible.
If you have MapReduce v.2, you could implement the aforementioned logic like that (pseudocode):
Mapper:
Set<Set> superSets;
setup() {
superSets = new HashSet<>();
}
map(inputSet){
Set toReplace = null;
for (Set superSet : superSets) {
if (superSet.contains(inputSet) {
return;
}
if (inputSet.contains(superSet)) {
toReplace = superSet;
break;
}
}
if (toReplace != null) {
superSets.remove(toReplace);
}
superSets.add(inputSet);
}
close() {
for (Set superSet : superSets) {
context.write(superSet.iterator.next(), superSet);
}
}
You can use the same code in the reducer and in the combiner.
As a final note, I doubt that MapReduce is the right environment for this kind of computations. Perhaps Apache Spark, or Apache Flink offer some better alternatives.
If I understand:
your goal is to detect and remove subset of set in a large set of sets
there are too sets to be managed altogether (memory limit)
strategy is map and reduce (or some sort of)
What I take into account:
main problem is that you can not managed everything at same time
usual method map/reduce supposes to split datas, and treat each part. This is not done totally like that
(because each subset can intersect with each subset).
If I make some calculations:
suppose you have a large set : 1000 000 of 3 to 20 numbers from 1 to 100.
you should have to compare 1000 Billions couples of sets
Even with 100 000 (10 billions), it takes too much times (I stopped).
What I propose (test with 100 000 sets) :
1 define a criterion to split in more little compatible sets. Compatible sets are packages of sets, and you are sure subsets of sets are at least in one same package: then you are sure to find subsets to remove with that method. Say differently: if set A is a subset of set B, then A and B will reside in one (or several) packages like that.
I just take: every subset which contains one defined element (1, 2, 3, ...) => it gives approximately 11 500 sets with precedent assumptions.
It become reasonable to compare (120 000 comparisons).
It takes 180 seconds on my machine, and it found 900 subsets to remove.
you have to do it 100 times (then 18 000 seconds).
and of course, you can find duplicates (but not too many: some %, and the gool is to eliminate).
2 At end it is easy and fast to agglomerate. Duplicate work is light.
3 bigger filters:
with a filter with 2 elements, you reduce to 1475 sets => you get approximately 30 sets to delete, it takes 2-3 seconds
and you have to do that 10 000
Interest of this method:
the selection of sets on the criterion is linear and very simple. It is also hiearchical:
split on one element , on a second, etc.
it is stateless: you can filter millions of set. You only have to keep the good one. The more datas you have,
the more filter you have to do => solution is scalable.
if you want to treat little clouds, you can takes 3, 4 elements in common, etc.
like that, you can spread your treatment among multiple machines (as many as you have).
At the end, you have to reconciliate all your datas/deleting.
This solution doesnt keep a lot of time overall (you can do calculations), but it suits the need of splitting the work.
Hope it helps.
In order to compute the product between 2 matrices A and B (nxm dimension) in a parallel mode, I have the following restrictions: the server sends to each client a number of rows from matrix A, and a number of rows from matrix B. This cannot be changed. Further the clients may exchange between each other information so that the matrices product to be computed, but they cannot ask the server to send any other data.
This should be done the most efficient possible, meaning by minimizing the number of messages sent between processes - considered as an expensive operation - and by doing the small calculations in parallel, as much as possible.
From what I have researched, practically the highest number of messages exchanged between the clients is n^2, in case each process broadcasts its lines to all the others. Now, the problem is that if I minimize the number of messages sent - this would be around log(n) for distributing the input data - but the computation then would only be done by one process, or more, but anyhow, it is not anymore done in parallel, which was the main idea of the problem.
What could be a more efficient algorithm, that would compute this product?
(I am using MPI, if it makes any difference).
To compute the matrix product C = A x B element-by-element you simply calculate C(i,j) = dot_product(A(i,:),B(:,j)). That is, the (i,j) element of C is the dot product of row i of A and column j of B.
If you insist on sending rows of A and rows of B around then you are going to have a tough time writing a parallel program whose performance exceeds a straightforward serial program. Rather, what you ought to do is send rows of A and columns of B to processors for computation of elements of C. If you are constrained to send rows of A and rows of B, then I suggest that you do that, but compute the product on the server. That is, ignore all the worker processors and just perform the calculation serially.
One alternative would be to compute partial dot-products on worker processors and to accumulate the partial results. This will require some tricky programming; it can be done but I will be very surprised if, at your first attempt, you can write a program which outperforms (in execution speed) a simple serial program.
(Yes, there are other approaches to decomposing matrix-matrix products for parallel execution, but they are more complicated than the foregoing. If you want to investigate these then Matrix Computations is the place to start reading.)
You need also to think hard about your proposed measures of efficiency -- the most efficient message-passing program will be the one which passes no messages. If the cost of message-passing far outweighs the cost of computation then the no-message-passing implementation will be the most efficient by both measures. Generally though, measures of the efficiency of parallel programs are ratios of speedup to number of processors: so 8 times speedup on 8 processors is perfectly efficient (and usually impossible to achieve).
As stated yours is not a sensible problem. Either the problem-setter has mis-specified it, or you have mis-stated (or mis-understood) a correct specification.
Something's not right: if both matrices have n x m dimensions, then they can not be multiplied together (unless n = m). In the case of A*B, A has to have as many columns as B has rows. Are you sure that the server isn't sending rows of B's transposed? That would be equivalent to sending columns from B, in which case the solution is trivial.
Assuming that all those check out, and your clients do indeed get rows from A and B: probably the easiest solution would be for each client to send its rows of matrix B to client #0, who reassambles the original matrix B, then sends out its columns back to the other clients. Basically, client #0 would act as a server that actually knows how to efficiently decompose data. This would be 2*(n-1) messages (not counting the ones used to reunite the product matrix), but considering how you already need n messages to distribute the A and B matrices between the clients, there's no significant performance loss (it's still O(n) messages).
The biggest bottleneck here is obviously the initial gathering and redistribution of the matrix B, which scales terribly, so if you have fairly small matrices and a lot of processes, you might just be better off calculating the product serially on the server.
I don't know if this is homework. But if it is not homework, then you should probably use a library. One idea is scalapack
http://www.netlib.org/scalapack/scalapack_home.html
Scalapack is writtten in fortran, but you can call it from c++.
A software application that I'm working on needs to be able to assign tasks to a group of users based on how many tasks they presently have, where the users with the fewest tasks are the most likely to get the next task. However, the current task load should be treated as a weighting, rather than an absolute order definition. IOW, I need to implement a weighted, load-balancing algorithm.
Let's say there are five users, with the following number of tasks:
A: 4
B: 5
C: 0
D: 7
E: 9
I want to prioritize the users for the next task in the order CABDE, where C is most likely to get the assignment and E, the least likely. There are two important things to note here:
The number of users can vary from 2 to dozens.
The number of tasks assigned to each user can vary from 1 to hundreds.
For now, we can treat all tasks as equal, though I wouldn't mind including task difficult as a variable that I can use in the future - but this is purely icing on the cake.
The ideas I've come up with so far aren't very good in some situations. They might weight users too closely together if there are a large number of users, or they might fall flat if a user has no current tasks, or....
I've tried poking around the web, but haven't had much luck. Can anyone give me a quick summary of an algorithm that would work well? I don't need an actual implementation--I'll do that part--just a good description. Alternative, is there a good web site that's freely accessible?
Also, while I certainly appreciate quality, this need not be statistically perfect. So if you can think of a good but not great technique, I'm interested!
As you point out, this is a load-balancing problem. It's not really a scheduling problem, since you're not trying to minimise anything (total time, number of concurrent workers, etc.). There are no special constraints (job duration, time clashes, skill sets to match etc.) So really your problem boils down to selecting an appropriate weighting function.
You say there are some situations you want to avoid, like user weightings that are too close together. Can you provide more details? For example, what's wrong with making the chance of assignment just proportional to the current workload, normalised by the workload of the other workers? You can visualise this as a sequence of blocks of different lengths (the tasks), being packed into a set of bins (the workers), where you're trying to keep the total height of the bins as even as possible.
With more information, we could make specific recommendations of functions that could work for you.
Edit: example load-balancing functions
Based on your comments, here are some example of simple functions that can give you different balancing behaviour. A basic question is whether you want deterministic or probabilistic behaviour. I'll give a couple of examples of each.
To use the example in the question - there are 4 + 5 + 0 + 7 + 9 = 25 jobs currently assigned. You want to pick who gets job 26.
1) Simple task farm. For each job, always pick the worker with the least jobs currently pending. Fast workers get more to do, but everyone finishes at about the same time.
2) Guarantee fair workload. If workers work at different speeds, and you don't want some doing more than others, then track the number of completed + pending jobs for each worker. Assign the next job to keep this number evenly spread (fast workers get free breaks).
3) Basic linear normalisation. Pick a maximum number of jobs each worker can have. Each worker's workload is normalised to that number. For example, if the maximum number of jobs/worker is 15, then 50 more jobs can be added before you reach capacity. So for each worker the probability of being assigned the next job is
P(A) = (15 - 4)/50 = 0.22
P(B) = (15 - 5)/50 = 0.2
P(C) = (15 - 0)/50 = 0.3
P(D) = (15 - 7)/50 = 0.16
P(E) = (15 - 9)/50 = 0.12
If you don't want to use a specific maximum threshold, you could use the worker with the highest current number of pending jobs as the limit. In this case, that's worker E, so the probabilities would be
P(A) = (9 - 4)/20 = 0.25
P(B) = (9 - 5)/20 = 0.2
P(C) = (9 - 0)/20 = 0.45
P(D) = (9 - 7)/20 = 0.1
P(E) = (9 - 9)/20 = 0
Note that in this case, the normalisation ensures worker E can't be assigned any jobs - he's already at the limit. Also, just because C doesn't have anything to do doesn't mean he is guaranteed to be given a new job (it's just more likely).
You can easily implement the choice function by generating a random number r between 0 and 1 and comparing it to these boundaries. So if r is < 0.25, A gets the job, 0.25< r < 0.45, B gets the job, etc.
4) Non-linear normalisation. Using a log function (instead of the linear subtraction) to weight your numbers is an easy way to get a non-linear normalisation. You can use this to skew the probabilities, e.g. to make it much more likely that workers without many jobs are given more.
The point is, the number of ways of doing this are practically unlimited. What weighting function you use depends on the specific behaviour you're trying to enable. Hopefully that's given you some ideas which you can use as a starting point.