Efficient all-pairs set intersection on GPU

Efficient all-pairs set intersection on GPU - algorithm

I have n sets, subsets of a finite universe. I want to calculate the n*n matrix in which the (I, J) entry contains the cardinality of the intersection of set I and set J. n is in the order of 50000.
My idea is to split the matrix into blocks sufficiently small so to have one thread per entry. Every thread should calculate the intersection using bitwise and.
Are there more efficient approaches to solve this problem?

I'm going to assume you want to compute it as you described: actually computing the intersection of every pair of sets, using bitwise and of bitsets.
With the right mathematical setup, you are literally computing the outer product of two vectors, so I will think in terms of high performance linear algebra.
The key to performance is going to be reducing memory traffic, and that means holding things in registers when you can. The overwhelmingly most significant factor is that your elements are huge; it takes 6250 32-bit words to store a single set! An entire multiprocessor of cuda compute capability 3.0, for example, can only hold a mere 10 sets in registers.
What you probably want to do is to spread each element out across an entire thread block. With 896 threads in a block and 7 registers per block, you can store one set of 200704 elements. With cuda compute capability 3.0, you will have 36 registers available per block.
The simplest implementation would be to have each block own one row of the output matrix. It loads the corresponding element of the second vector and stores it in registers, and then iterates over all of the elements of the first vector, computing the intersection, computing and reducing the popcount, and then storing the result in the output vector.
This optimization should reduce the overall number of memory reads by a factor of 2, and thus is likely to double performance.
Better would be to have each block own 3-4 rows of the output matrix at once, and loads the corresponding 3-4 elements of the second vector into registers. Then the block iterates over all of the elements of the first register, and for each it computes the 3-4 intersections it can, storing the result in the output matrix.
This optimization reduces the memory traffic by an additional factor of 3-4.

A completely different approach would be to work with each element of the universe individually: for each element of the universe, you compute which sets actually contain that element, and then (atomically) increment the corresponding entries of the output matrix.
Asymptotically, this should be much more efficient than computing the intersections of sets. Unfortunately, it sounds hard to implement efficiently.
An improvement is to work with, say, 4 elements of the universe at a time. You split up all of your sets up into 16 buckets, depending on which of those 4 elements the set contains. Then, for each of the 16*16 possible pairs of buckets, you iterate through all pairs of vectors from the buckets and (atomically) update the corresponding entry of the matrix appropriately.
This should be even faster than the version described above, but it still may potentially be difficult to implement.
To reduce the difficulty of getting all of the synchronization worked out, you could partition all of input sets into k groups of n/k sets each. Then, the (i,j)-th thread (or warp or block) only does the updates for the corresponding block of the output matrix.

A different approach to breaking up the problem is to to split the universe into smaller partitions of 1024 elements each, and compute just the size of the intersections in this part of the universe.
I'm not sure if I've described that well; basically you're computing
A[i,j] = sum((k in v[i]) * (k in w[j]) for k in the_universe)
where v and w are the two lists of sets, and k in S is 1 if true and 0 otherwise. The point is to permute the indices so that k is in the outer loop rather than the inner loop, although for efficiency you will have to work with many consecutive k at once, rather than one at a time.
That is, you initialize the output matrix to all zeroes, and for each block of 1024 universe elements, you compute the sizes of the intersections and accumulate the results into the output matrix.
I choose 1024, because I imagine you'll have a data layout where that's probably the smallest size where you can still get the full memory bandwidth when reading from device memory, and all of the threads in warp work together. (adjust this appropriately if you know better than me, or you aren't using nVidia and whatever other GPUs you're using would work with something better)
Now that your elements are a reasonable size, you can now appeal to traditional linear algebra optimizations to compute this product. I would probably do the following:
Each warp is assigned a large number of rows of the output matrix. It reads the corresponding elements out of the second vector, and then iterates through the first vector, computing products.
You could have all of the warps operate independently, but it may be better to do the following:
All of the warps in a block work together to load some number of elements from the first vector
Each warp computes the intersections it can and writes the results to the output matrix
You could store the loaded elements in shared memory, but you might get better results holding them in registers. Each warp can only compute the intersections with the set elements its holding onto, and you but after doing so the warps can all rotate which warps are holding which elements.
If you do enough optimizations along these lines, you will probably reach the point where you are no longer memory bound, which means you might not have to go so far as to do the most complicated optimizations (e.g. the shared memory approach described above might already be enough).

Related

Is it better to reduce the space complexity or the time complexity for a given program?

Grid Illumination: Given an NxN grid with an array of lamp coordinates. Each lamp provides illumination to every square on their x axis, every square on their y axis, and every square that lies in their diagonal (think of a Queen in chess). Given an array of query coordinates, determine whether that point is illuminated or not. The catch is when checking a query all lamps adjacent to, or on, that query get turned off. The ranges for the variables/arrays were about: 10^3 < N < 10^9, 10^3 < lamps < 10^9, 10^3 < queries < 10^9
It seems like I can get one but not both. I tried to get this down to logarithmic time but I can't seem to find a solution. I can reduce the space complexity but it's not that fast, exponential in fact. Where should I focus on instead, speed or space? Also, if you have any input as to how you would solve this problem please do comment.

Is it better for a car to go fast or go a long way on a little fuel? It depends on circumstances.
Here's a proposal.
First, note you can number all the diagonals that the inputs like on by using the first point as the "origin" for both nw-se and ne-sw. The diagonals through this point are both numbered zero. The nw-se diagonals increase per-pixel in e.g the northeast direction, and decreasing (negative) to the southwest. Similarly ne-sw are numbered increasing in the e.g. the northwest direction and decreasing (negative) to the southeast.
Given the origin, it's easy to write constant time functions that go from (x,y) coordinates to the respective diagonal numbers.
Now each set of lamp coordinates is naturally associated with 4 numbers: (x, y, nw-se diag #, sw-ne dag #). You don't need to store these explicitly. Rather you want 4 maps xMap, yMap, nwSeMap, and swNeMap such that, for example, xMap[x] produces the list of all lamp coordinates with x-coordinate x, nwSeMap[nwSeDiagonalNumber(x, y)] produces the list of all lamps on that diagonal and similarly for the other maps.
Given a query point, look up it's corresponding 4 lists. From these it's easy to deal with adjacent squares. If any list is longer than 3, removing adjacent squares can't make it empty, so the query point is lit. If it's only 3 or fewer, it's a constant time operation to see if they're adjacent.
This solution requires the input points to be represented in 4 lists. Since they need to be represented in one list, you can argue that this algorithm requires only a constant factor of space with respect to the input. (I.e. the same sort of cost as mergesort.)
Run time is expected constant per query point for 4 hash table lookups.
Without much trouble, this algorithm can be split so it can be map-reduced if the number of lampposts is huge.
But it may be sufficient and easiest to run it on one big machine. With a billion lamposts and careful data structure choices, it wouldn't be hard to implement with 24 bytes per lampost in an unboxed structures language like C. So a ~32Gb RAM machine ought to work just fine. Building the maps with multiple threads requires some synchronization, but that's done only once. The queries can be read-only: no synchronization required. A nice 10 core machine ought to do a billion queries in well less than a minute.

There is very easy Answer which works
Create Grid of NxN
Now for each Lamp increment the count of all the cells which suppose to be illuminated by the Lamp.
For each query check if cell on that query has value > 0;
For each adjacent cell find out all illuminated cells and reduce the count by 1
This worked fine but failed for size limit when trying for 10000 X 10000 grid

How to get N greatest elements out of M elements using CUDA, where N << M?

I am just wondering whether there is any efficient ways of getting N greatest elements out of M elements, where N is much smaller than M (e.g. N = 10, and M = 1000) using the GPU.
The problem is that - due to the large size of input data, I really do not want to transfer the data from the GPU to the CPU and then get it back. However, exact sorting does not seem to work well because of thread divergence and the time wasted on sorting elements that we do not really care about (in the case above, DC elements are 11 ~ 1000).

If N is small enough that the N largest values can be kept in shared memory, that would allow a fast implementation that only reads through your array of M elements in global memory once and then immediately writes out these N largest values. Implementation becomes simpler if N also doesn't exceed the maximum number of threads per block.
Contrary to serial programming, I would not use a heap (or other more complicated data structure), but just a sorted array. There is plenty of parallel hardware on an SM that would go unused when traversing a heap. The entire thread block can be used to shift the elements of the shared memory array that are smaller than the newly incoming value.
If N<=32, a neat solution is possible that keeps a sorted list of the N largest numbers in registers, using warp shuffle functions.

CUDA: Launching many parallel calls to cuBLAS on different subsections of a matrix, without serializing

In my application, I have a double complex N*3 matrix (where N is several thousand) and a 3*1 vector, and I am forming an N*1 using zgemv.
The N*3 is a subsection of a larger M*3 matrix (where M is slightly larger then N, but the same order of magnitude).
Each thread must perform a zgemv call to a different subsection of the larger matrix. That is, the N*3 is different for every thread. But all of the N*3 are formed from some portion of the larger M*3.
There isn't enough memory for each thread to store an independent N*3. Furthermore, the M*3 is too large to fit in shared memory. Thus each thread must pull its data from a single copy of the M*3. How can I do this without millions of threads serializing memory reads to the same memory locations in the M*3? Is there a more efficient way to approach this?

Probably, based on what I can gather so far, there are 2 types of optimizations I would want to consider:
convert operations that use the same N subset to a matrix-matrix multiply (zgemm), instead of multiple zgemv operations.
cache-block for the GPU L2 cache.
I'll discuss these in reverse order using these numbers for discussion:
M: ~10,000
N: ~3,000
cublas zgemv calls: ~1e6
"typical" Kepler L2: 1.5MB
An Nx3 matrix requires approximately 10,000 elements, each of which is 16 bytes, so let's call it 160K bytes. So we could store ~5-10 of these subsets in a memory size comparable to L2 cache size (without taking into account overlap of subsets - which would increase the residency of subsets in L2).
There are (M-N) possible unique contiguous N-row subsets in the M matrix. There are 1e6 zgemv calls, so on average each subset gets re-used 1e6/M-N times, approximately 100-150 times each. We could store about 10 of these subsets in the proposed L2, so we could "chunk" our 1e6 calls into "chunks" of ~1,000 calls that all operate out of the same data set.
Therefore the process I would follow would be:
transfer the M*3 matrix to the device
predetermine the N*3 subset needed by each thread.
sort or otherwise group like subsets together
divide the sorted sets into cache-sized blocks
for each block, launch a CDP kernel that will spawn the necessary zgemv calls
repeat the above step until all blocks are processed.
One might also wonder if such a strategy could be extended (with considerably more complexity) to L1/Texture. Unfortunately, I think CDP would confound your efforts to achieve this. It's pretty rare that people want to invest the effort to cache-block for L1 anyway.
To extend the above strategy to the gemm case, once you sort your zgemv operations by the particular N subset they require, you will have grouped like operations together. If the above arithmetic is correct, you will have on average around 100-150 gemv operations needed for each particular N-subset. You should group the corresponding vectors for those gemv operations into a matrix, and convert the 100-150 gemv operations into a single gemm operation.
This reduces your ~1e6 zgemv operations to ~1e4 zgemm operations. You can then still cache-block however many of these zgemm operations will be "adjacent" in M and fit in a single cache-block, into a single CDP kernel call, to benefit from L2 cache reuse.
Given the operational intensity of GEMM vs. GEMV, it might make sense to dispense with the complexity of CDP altogether, and simply run a host loop that dispatches the ZGEMM call for a particular N subset. That host loop would iterate for M-N loops.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)

INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.

I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.

My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.

The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

How to quickly count the number of neighboring voxels?

I have got a 3D grid (voxels), where some of the voxels are filled, and some are not. The 3D grid is sparsely filled, so I have got a set filledVoxels with coordinates (x, y, z) of the filled voxels. What I am trying to do is find out is for each filled voxel, how many neighboring voxels are filled too.
Here is an example:
filledVoxels contains the voxels (1, 1, 1), (1, 2, 1), and (1, 3, 1).
Therefore, the neighbor counts are:
(1,1,1) has 1 neighbor
(1,2,1) has 2 neighbors
(1,3,1) has 1 neighbor.
Right now I have this algorithm:
voxelCount = new Map<Voxel, Integer>();
for (voxel v in filledVoxels)
count = checkAllNeighbors(v, filledVoxels);
voxelCount[v] = count;
end
checkAllNeighbors() looks up all 26 surrounding voxels. So in total I am doing 26*filledVoxels.size() lookups, which is quite slow.
Is there any way to cut down the number of required lookups? When you look at the above example you can see that I am checking the same voxels several times, so it might be possible to get rid of lookups with some clever caching.
If this helps in any way, the voxels represent a voxelized 3D surface (but there might be holes in it). I usually want to get a list of all voxels that have 5 or 6 neighbors.

You can transform your voxel space into a octree in which every node contains a flag that specifies whether it contains filled voxels at all.
When a node does not contain filled voxels, you don't need to check any of its descendants.

I'd say if each of your lookups is slow (O(size)), you should optimize it by binary search in an ordered list (O(log(size))).
The constant 26, I wouldn't worry much. If you iterate smarter, you could cache something and have 26 -> 10 or something, I think, but unless you have profiled the whole application and found out decisively that it is the bottleneck I would concentrate on something else.

As ilya states, there's not much you can do to get around the 26 neighbor look-ups. You have to make your biggest gains in efficiently identifying whether a given neighbor is filled or not. Given that the brute force solution is essentially O(N^2), you have a lot of possible ground to gain in that area. Since you have to iterate over all filled voxels at least once, I would take an approach similar to the following:
voxelCount = new Map<Voxel, Integer>();
visitedVoxels = new EfficientSpatialDataType();
for (voxel v in filledVoxels)
for (voxel n in neighbors(v))
if (visitedVoxels.contains(n))
voxelCount[v]++;
voxelCount[n]++;
end
next
visitedVoxels.add(v);
next
For your efficient spatial data type, a kd-tree, as Zifre suggested, might be a good idea. In any case, you're going to want to reduce your search space by binning visited voxels.

If you're marching along the voxels one at a time, you can keep a lookup table corresponding to the grid, so that after you've checked it once using IsFullVoxel() you put the value in this grid. For each voxel you're marching in you can check if its lookup table value is valid, and only call IsFullVoxel() it it isn't.
OTOH it seems like you can't avoid iterating over all neighboring voxels, either using IsFullVoxel() or the LUT. If you had some more a priori information it could help. For instance, if you knew that there were at most x neighboring filled voxels, or you knew that there were at most y neighboring filled voxels in each direction. For instance, if you know you're looking for voxels with 5 to 6 neighbors, you can stop after you've found 7 full neighbors or 22 empty neighbors.
I'm assuming that a function IsFullVoxel() exists that returns true if a voxel is full.

If most of the moves in your iteration were to neighbors, you could reduce your checking by around 25% by not looking back at the ones you just checked before you made the step.

You may find a Z-order curve a useful concept here. It lets you (with certain provisos) keep a sliding window of data around the point you're currently querying, so that when you move to the next point, you don't have to throw away many of the queries you've already performed.

Um, your question is not very clear. I'm assuming you just have a list of the filled points. In that case, this is going to be very slow, because you have to iterate through it (or use some kind of tree structure such as a kd-tree, but this will still be O(log n)).
If you can (i.e. the grid is not too big), just make a 3d array of bools. 26 lookups in a 3d array shouldn't really take that long (and there really is no way to cut down on the number of lookups).
Actually, now that I think of it, you could make it a 3d array of longs (64 bits). Each 64 bit block would hold 64 (4 x 4 x 4) voxels. When you are checking the neighbors of a voxel in the middle of the block, you could do a single 64 bit read (which would be much faster).

Is there any way to cut down the number of required lookups?
You will, at minimum, have to perform at least 1 lookup per voxel. Since that's the minimum, then any algorithm which only performs one lookup per voxel will meet your requirement.
One simplistic idea is to initialize an array to hold the count for each voxel, then look at each voxel and increment the neighbors of that voxel in the array.
Pseudo C might look something like this:
#define MAXX 100
#define MAXY 100
#define MAXZ 100
int x, y, z
char countArray[MAXX][MAXY][MAXZ];
initializeCountArray(MAXX, MAXY, MAXZ); // Set all array elements to 0
for(x=0; x<MAXX; x++)
for(y=0;y<MAXY;y++)
for(z=0;z<MAXZ;z++)
if(VoxelExists(x,y,z))
incrementNeighbors(x,y,z);
You'll need to write initializeCountArray so it sets all array elements to 0.
More importantly you'll also need to write incrementNeighbors so that it won't increment outside the array. A slight speed increase here is to only perform the above algorithm on all voxels on the interior, then do a separate run on all the outside edge voxels with a modified incrementNeighbrs routine that understands there won't be neighbors on one side.
This algorithm results in 1 lookup per voxel, and at most 26 byte additions per voxel. If your voxel space is sparse then this will result in very few (relative) additions. If your voxel space is very dense, you might consider reversing the algorithm - initialize the array to the value of 26 for each entry, then decrement the neighbors when a voxel doesn't exist.
The results for a given voxel (ie, how many neighbors do I have?) reside in the array. If you need to know how many neighbors voxel 2,3,5 has, just look at the byte in countArray[2][3][5].
The array will consume 1 byte per voxel. You could use less space, and possibly increase the speed a little bit by packing the bytes.
There are better algorithms if you know details about your data. For instance, a very sparse voxel space will benefit greatly from an octree, where you can skip large blocks of lookups when you already know there are no filled voxels inside. Most of these algorithms, however, would still require at least one lookup per voxel to fill their matrix, but if you are performing several operations then they may benefit more than this one operation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio