CSR x CSR Matrix Multiplication for Finding Out Cycles - matrix

I am trying to find out number of cycles in an undirected graph with specified length(k) containing vertex u for each vertex u in the graph. To do so I am trying to find out adjacency matrix's k'th power. I created CSR representation of the graph from the edge list. It is working really fast. But the CSR x CSR multiplication part is really slow, (it seems to be taking 50 min with an input size of 500k x 500k matrix). I am curious about a better solution. Is there a more efficient way to go since this is a adjacency matrix? Or Is there any better CSRxCSR matrix multiplication that I could look at? I could not find any CSR X CSR matrix multiplication example as an algorithm or c++ implementation.
void multiply_matrix(std::vector<int> &adj, std::vector<int> &xadj, std::vector<int> &values, std::vector<int> &adj2, std::vector<int> &xadj2, std::vector<int> &values2, int size)
std::vector<int> result_adj;
std::vector<int> result_xadj(size+1,0);
std::vector<int> result_value(values.size(),0);
for(int i = 0; i<size; i++)
for(int j = 0; j<size; j++)
int result = 0;
int startIndex = xadj[i];
int endIndex = xadj[i+1];
for(int index = startIndex; index<endIndex; index++)
int currentValRow = values[adj[index]];
bool shouldContinue = false;
for(int colIndex = xadj2[j]; colIndex<xadj2[j+1]; colIndex++)
if(adj[index] == adj2[colIndex])
shouldContinue = true;
int currentValCol = values2[adj2[index]];
result += currentValCol*currentValRow;
if(result != 0)
if(i+2 < result_xadj.size())
result_xadj[i+2] = result_xadj[i+1];
result_value[j] = result;

I solved my problem and wanted to share with those who also lacks the required "terminology" to find out lots of resources on the topic. When you google "sparse matrix multiplication" it is hard to find sparse matrix x sparse matrix. Which is called SpGEMM. There are lots of informative papers about the process.
The pseudocode of the algorithm I used:
General SpGEMM algorithm
I modified the algorithm a little bit to produce CSR output. The challange with that seems to be the allocation for result arrays to hold csr arrays (values, index_array, etc..). There are different methods used to solve that issue such as:
Allocating the arrays as big as the upper bound. Which may be a problem if your matrices are too big. If you decide to go this way you can look into: https://math.stackexchange.com/questions/1042096/bounds-of-sparse-matrix-multiplication.
Before allocating any memory for the result the multiplication operation can be done to determine the amount of non zeros in the result. Since there is no memory write operation exists in this space the result comes out really fast. So the memory required for the result arrays can be allocated after this "dummy run".
Allocating a pre-determined amount and when it is not sufficient allocating a new array and copying the content to the new, bigger array.
I implemented the function for both CPU (using OpenMP) and GPU (using CUDA). In the OpenMP approach I used a method similar to option 3 that i have listed. I used separate vectors for results of each row. Than I added the resulting vectors. The vector approach may be slower than doing the re-allocation operation manually but it was easier so I choose that way and it is fast enough (the test matrix had 500k row and 500k column the multiplication operation takes around 1.3 seconds using 60 threads on my test machine). For the GPU approach I used the option 2. At first I calculated the required amount then the actual operation happens.
Edit: Also this method finds out "walks" rather than paths. So there might be repeated vertices.


PyOpenCL - Multi-dimensional reduction kernel

I'm a total newbie to OpenCL.
I'm trying to code a reduction kernel that sums along one axis for a multi-dimensional array. I have stumbled upon that code which comes from here: https://tmramalho.github.io/blog/2014/06/16/parallel-programming-with-opencl-and-python-parallel-reduce/
__kernel void reduce(__global float *a, __global float *r, __local float *b) {
uint gid = get_global_id(0);
uint wid = get_group_id(0);
uint lid = get_local_id(0);
uint gs = get_local_size(0);
b[lid] = a[gid];
for(uint s = gs/2; s > 0; s >>= 1) {
if(lid < s) {
b[lid] += b[lid+s];
if(lid == 0) r[wid] = b[lid];
I don't understand the for loop part. I get that uint s = gs/2 means that we split the array in half, but then it is a complete mystery. Without understanding it, I can't really implement another version for taking the maximum of an array for instance, even less for multi-dimensional arrays.
Furthermore, as far as I understand, the reduce kernel needs to be rerun another time if "N is bigger than the number of cores in a single unit".
Could you give me further explanations on that whole piece of code? Or even guidance on how to implement it for taking the max of an array?
Complete code can be found here: https://github.com/tmramalho/easy-pyopencl/blob/master/008_localreduce.py
Your first question about the meaning of the for loop:
for(uint s = gs/2; s > 0; s >>= 1)
It means that you divide the local size gs by 2, and keep dividing by 2 (the shift part s >>= 1 is equivalent to s = s/2) while s > 0, in other words, until s = 1. This algorithm depends on your array's size being a power of 2, otherwise you'd have to deal with the excess of a power of 2 until you have reduced the whole array, or you'd have to fill your array with neutral values for the reduction until completing a power of 2 size.
Your second concern when N is bigger than the capacity of your GPU, you are right: you have to run your reduction in portions that fit and then merge the results.
Finally, when you ask for guidance on how to implement a reduction to get the max of an array, I would suggest the following:
For a simple reduction like max or sum, try using numpy, especially if you are dealing with programming the reduction by axis.
If you think that the GPU would give you an advantage, try first using pyopencl's Multidimensional Array functionality, e.g. max.
If the reduction is more math intensive, try using pyopencl's Parallel Algorithms, e.g. reduction
I think that the whole point of using pyopencl is to avoid dealing with the underlying GPU's architecture. Otherwise, it is easier to deal with CUDA or HIP directly instead of OpenCL.

What are some fast entropy calculation algorithms

private double log(double num, int base){
return Math.log(num)/Math.log(base);
public double entropy(List<String> data){
double entropy = 0.0;
double prob = 0.0;
String[] keys = iFrequency.getKeys();
for(int i=0;i<keys.length;i++){
prob = iFrequency.getPct(keys[i]);
entropy = entropy - prob * log(prob,2);
return entropy;
I wrote a function that calculates the entropy of a data set. The function works fine and the math is correct. Everything would be fine if I was working with small data sets, but the problem is that I'm using this function to calculate the entropy of sets that have thousands or tens of thousands of members and my algorithm runs slowly.
Are there any algorithms other than the one that I'm using that can be used to calculate the entropy of a set? If not, are there any optimizations that I can add to my code to make it run faster?
I found this question, but they didn't really go into details.
First of all, it appears that you've built an O(N^2) algorithm, in that you recompute the sum of counts on every call to getPct. I recommend two operations:
(1) Sum the counts once and store the value. Compute prob manually as value[i] / sum.
(2) You'll save a small amount of time if you compute entropy as the sum prob * Math.log(prob). When you're all done, divide once by Math.log(2).

Combinations of integers in OpenCL

I have a bunch of vectors (~500). I need to find triple products of all the combinations of the vectors in OpenCL. There are plenty of combination algorithms (r out of n things) in C++ but I am yet to find any implemented for GPU. I have seen quite a few parallel permutation algorithms in Cuda but I just want to know if there are any viable combination algorithms present?
I'll need to guess a bit here and there to answer your question.
I suppose you have an array V of n (~500) vectors. These vectors are all of same dimensionality m (probably m=3).
What you want is the component wise product of each 3 vectors vi, vj, vk where i,j,k in {0,..,n-1}.
Simple 3-dimensional example:
result[idx].x = V[i].x * V[j].x * V[k].x;
result[idx].y = V[i].y * V[j].y * V[k].y;
result[idx].z = V[i].z * V[j].z * V[k].z;
Now maybe your vectors are not 3-dimensional and maybe you don't want the component wise product but the sum of it (like in dot product), but I'm sure you're able to djust the code accordingly.
The real question here is how to compute all possible i,j,k and idx. Correct?
Now with CUDA you are in a very fortunate position. You can just launch n*n*n threads in a grid and therefore get i,j,k for free without having to think about ways to compute combinations or permutations at all. Just do the following:
dim3 grid, block;
block.x = n;
block.y = 1;
block z = 1;
grid.x = n;
grid.y = n;
grid.z = 1;
compute_product_kernel<<<grid, block>>>( V, result );
This way you'll launch n*n blocks of n threads. Computing i,j,k becomes trivial, computing idx is easy:
__device__ void compute_product_kernel( myVector* V, myVector* result)
int i = blockIdx.x;
int j = blockIdx.y;
int k = threadIdx.x;
int idx = i * gridDim.y * blockDim.x + j * blockDim.x + k;
Of course all of this only works because your n is within the limits of CUDA's block and grid range.
Two more things though:
Maybe you want permutations instead of combinations. You could do that by skipping every combination where any two of i,j,k are the same. But I'd recommend keeping them anyway because computing when to skip is probably more expensive that doing the actual work. Also I'd advise against using the permutation to save memory for result because it would save you less that 1% and make the calculation much more complex.
Are you sure you've got enough memory to actually do this? Storing the result requires n*n*n*m*sizeof(float) bytes. With n=500 and m=3 that would already be 1.5 GB. Is that really what you are looking for? Maybe the next step of your processing can be combined into the calculation so that storing the intermediate result is not neccessary.

Selecting evenly distributed points algorithm

Suppose there are 25 points in a line segment, and these points may be unevenly distributed (spatially) as the following figure shows:
My question is how we can select 10 points among these 25 points so that these 10 points can be as spatially evenly distributed as possible. In the idea situation, the selected points should be something like this:
It is true that this question can become more elegant if I can tell the criterion that justify the "even distribution". What I know is my expection for the selected points: if I divide the line segment into 10 equal line segments. I expect there should be one point on each small line segment. Of course it may happen that in some small line segments we cannot find representative points. In that case I will resort to its neighboring small line segment that has representative point. In the next step I will further divide the selected neighboring segment into two parts: if each part has representative points, then the empty representative point problem will be solved. If we cannot find representative point in one of the small line segments, we can further divide it into smaller parts. Or we can resort to the next neighboring line segment.
Using dynamic programming, a possible solution is implemented as follows:
#include <iostream>
#include <vector>
using namespace std;
struct Note
int previous_node;
double cost;
typedef struct Note Note;
int main()
double dis[25] =
{0.0344460805029088, 0.118997681558377, 0.162611735194631,
0.186872604554379, 0.223811939491137, 0.276025076998578,
0.317099480060861, 0.340385726666133, 0.381558457093008,
0.438744359656398, 0.445586200710900, 0.489764395788231,
0.498364051982143, 0.585267750979777, 0.646313010111265,
0.655098003973841, 0.679702676853675, 0.694828622975817,
0.709364830858073, 0.754686681982361, 0.765516788149002,
0.795199901137063, 0.823457828327293, 0.950222048838355, 0.959743958516081};
Note solutions[25];
for(int i=0; i<25; i++)
solutions[i].cost = 1000000;
solutions[0].cost = 0;
solutions[0].previous_node = 0;
for(int i=0; i<25; i++)
for(int j= i-1; j>=0; j--)
double tempcost = solutions[j].cost + std::abs(dis[i]-dis[j]-0.1);
if (tempcost<solutions[i].cost)
solutions[i].previous_node = j;
solutions[i].cost = tempcost;
vector<int> selected_points_index;
int i= 24;
while (solutions[i].previous_node != 0)
i = solutions[i].previous_node;
for(int i=0; i<selected_points_index.size(); i++)
return 0;
The result are shown in the following figure, where the selected points are denoted as green:
Until a good, and probably O(n^2) solution comes along, use this approximation:
Divide the range into 10 equal-sized bins. Choose the point in each bin closest to the centre of each bin. Job done.
If you find that any of the bins is empty choose a smaller number of bins and try again.
Without information about the scientific model that you are trying to implement it is difficult (a) to suggest a more appropriate algorithm and/or (b) to justify the computational effort of a more complicated algorithm.
Let {x[i]} be your set of ordered points. I guess what you need to do is to find the subset of 10 points {y[i]} that minimizes \sum{|y[i]-y[i-1]-0.1|} with y[-1] = 0.
Now, if you see the configuration as a strongly connected directed graph, where each node is one of the 25 doubles and the cost for every edge is |y[i]-y[i-1]-0.1|, you should be able to solve the problem in O(n^2 +nlogn) time with the Dijkstra's algorithm.
Another idea, that will probably lead to a better result, is using dynamic programming : if the element x[i] is part of our soltion, the total minimum is the sum of the minimum to get to the x[i] point plus the minimum to get the final point, so you could write a minimum solution for each point, starting from the smallest one, and using for the next one the minimum between his predecessors.
Note that you'll probably have to do some additional work to pick, from the solutions set, the subset of those with 10 points.
I've written this in c#:
for (int i = 0; i < 25; i++)
for (int j = i-1; j > 0; j--)
double tmpcost = solution[j].cost + Math.Abs(arr[i] - arr[j] - 0.1);
if (tmpcost < solution[i].cost)
solution[i].previousNode = j;
solution[i].cost = tmpcost;
I've not done a lot of testing, and there may be some problem if the "holes" in the 25 elements are quite wide, leading to solutions that are shorter than 10 elements ... but it's just to give you some ideas to work on :)
You can find approximate solution with Adaptive Non-maximal Suppression (ANMS) algorithm provided the points are weighted. The algorithm selects n best points while keeping them spatially well distributed (most spread across the space).
I guess you can assign point weights based on your distribution criterion - e.g. a distance from uniform lattice of your choice. I think the lattice should have n-1 bins for optimal result.
You can look up following papers discussing the 2D case (the algorithm can be easily realized in 1D):
Brown, Matthew, Richard Szeliski, and Simon Winder. "Multi-image matching using multi-scale oriented patches." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005.
The second paper is less related to your problem but it describes basic ANMS algorithm. The first papers provides faster solution. I guess both will do in 1D for a moderate amount of points (~10K).

Interview Question: Find Median From Mega Number Of Integers

There is a file that contains 10G(1000000000) number of integers, please find the Median of these integers. you are given 2G memory to do this. Can anyone come up with an reasonable way? thanks!
Create an array of 8-byte longs that has 2^16 entries. Take your input numbers, shift off the bottom sixteen bits, and create a histogram.
Now you count up in that histogram until you reach the bin that covers the midpoint of the values.
Pass through again, ignoring all numbers that don't have that same set of top bits, and make a histogram of the bottom bits.
Count up through that histogram until you reach the bin that covers the midpoint of the (entire list of) values.
Now you know the median, in O(n) time and O(1) space (in practice, under 1 MB).
Here's some sample Scala code that does this:
def medianFinder(numbers: Iterable[Int]) = {
def midArgMid(a: Array[Long], mid: Long) = {
val cuml = a.scanLeft(0L)(_ + _).drop(1)
cuml.zipWithIndex.dropWhile(_._1 < mid).head
val topHistogram = new Array[Long](65536)
var count = 0L
numbers.foreach(number => {
count += 1
topHistogram(number>>>16) += 1
val (topCount,topIndex) = midArgMid(topHistogram, (count+1)/2)
val botHistogram = new Array[Long](65536)
numbers.foreach(number => {
if ((number>>>16) == topIndex) botHistogram(number & 0xFFFF) += 1
val (botCount,botIndex) =
midArgMid(botHistogram, (count+1)/2 - (topCount-topHistogram(topIndex)))
(topIndex<<16) + botIndex
and here it is working on a small set of input data:
scala> medianFinder(List(1,123,12345,1234567,123456789))
res18: Int = 12345
If you have 64 bit integers stored, you can use the same strategy in 4 passes instead.
You can use the Medians of Medians algorithm.
If the file is in text format, you may be able to fit it in memory just by converting things to integers as you read them in, since an integer stored as characters may take more space than an integer stored as an integer, depending on the size of the integers and the type of text file. EDIT: You edited your original question; I can see now that you can't read them into memory, see below.
If you can't read them into memory, this is what I came up with:
Figure out how many integers you have. You may know this from the start. If not, then it only takes one pass through the file. Let's say this is S.
Use your 2G of memory to find the x largest integers (however many you can fit). You can do one pass through the file, keeping the x largest in a sorted list of some sort, discarding the rest as you go. Now you know the x-th largest integer. You can discard all of these except for the x-th largest, which I'll call x1.
Do another pass through, finding the next x largest integers less than x1, the least of which is x2.
I think you can see where I'm going with this. After a few passes, you will have read in the (S/2)-th largest integer (you'll have to keep track of how many integers you've found), which is your median. If S is even then you'll average the two in the middle.
Make a pass through the file and find count of integers and minimum and maximum integer value.
Take midpoint of min and max, and get count, min and max for values either side of the midpoint - by again reading through the file.
partition count > count => median lies within that partition.
Repeat for the partition, taking into account size of 'partitions to the left' (easy to maintain), and also watching for min = max.
Am sure this'd work for an arbitrary number of partitions as well.
Do an on-disk external mergesort on the file to sort the integers (counting them if that's not already known).
Once the file is sorted, seek to the middle number (odd case), or average the two middle numbers (even case) in the file to get the median.
The amount of memory used is adjustable and unaffected by the number of integers in the original file. One caveat of the external sort is that the intermediate sorting data needs to be written to disk.
Given n = number of integers in the original file:
Running time: O(nlogn)
Memory: O(1), adjustable
Disk: O(n)
Check out Torben's method in here:http://ndevilla.free.fr/median/median/index.html. It also has implementation in C at the bottom of the document.
My best guess that probabilistic median of medians would be the fastest one. Recipe:
Take next set of N integers (N should be big enough, say 1000 or 10000 elements)
Then calculate median of these integers and assign it to variable X_new.
If iteration is not first - calculate median of two medians:
X_global = (X_global + X_new) / 2
When you will see that X_global fluctuates not much - this means that you found approximate median of data.
But there some notes :
question arises - Is median error acceptable or not.
integers must be distributed randomly in a uniform way, for solution to work
I've played a bit with this algorithm, changed a bit idea - in each iteration we should sum X_new with decreasing weight, such as:
X_global = k*X_global + (1.-k)*X_new :
k from [0.5 .. 1.], and increases in each iteration.
Point is to make calculation of median to converge fast to some number in very small amount of iterations. So that very approximate median (with big error) is found between 100000000 array elements in only 252 iterations !!! Check this C experiment:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define ARRAY_SIZE 100000000
#define RANGE_SIZE 1000
// probabilistic median of medians method
// should print 5000 as data average
// from ARRAY_SIZE of elements
int main (int argc, const char * argv[]) {
int iter = 0;
int X_global = 0;
int X_new = 0;
int i = 0;
float dk = 0.002;
float k = 0.5;
while (i<ARRAY_SIZE && k!=1.) {
for (int j=i; j<i+RANGE_SIZE; j++) {
X_new+=rand()%10000 + 1;
if (iter>0) {
k += dk;
k = (k>1.)? 1.:k;
X_global = k*X_global+(1.-k)*X_new;
else {
X_global = X_new;
printf("iter %d, median = %d \n",iter,X_global);
return 0;
Opps seems i'm talking about mean, not median. If it is so, and you need exactly median, not mean - ignore my post. In any case mean and median are very related concepts.
Good luck.
Here is the algorithm described by #Rex Kerr implemented in Java.
* Computes the median.
* #param arr Array of strings, each element represents a distinct binary number and has the same number of bits (padded with leading zeroes if necessary)
* #return the median (number of rank ceil((m+1)/2) ) of the array as a string
static String computeMedian(String[] arr) {
// rank of the median element
int m = (int) Math.ceil((arr.length+1)/2.0);
String bitMask = "";
int zeroBin = 0;
while (bitMask.length() < arr[0].length()) {
// puts elements which conform to the bitMask into one of two buckets
for (String curr : arr) {
if (curr.startsWith(bitMask))
if (curr.charAt(bitMask.length()) == '0')
// decides in which bucket the median is located
if (zeroBin >= m)
bitMask = bitMask.concat("0");
else {
m -= zeroBin;
bitMask = bitMask.concat("1");
zeroBin = 0;
return bitMask;
Some test cases and updates to the algorithm can be found here.
I was also asked the same question and i couldn't tell an exact answer so after the interview i went through some books on interviews and here is what i found from Cracking The Coding interview book.
Example: Numbers are randomly generated and stored into an (expanding) array. How
wouldyoukeep track of the median?
Our data structure brainstorm might look like the following:
• Linked list? Probably not. Linked lists tend not to do very well with accessing and
sorting numbers.
• Array? Maybe, but you already have an array. Could you somehow keep the elements
sorted? That's probably expensive. Let's hold off on this and return to it if it's needed.
• Binary tree? This is possible, since binary trees do fairly well with ordering. In fact, if the binary search tree is perfectly balanced, the top might be the median. But, be careful—if there's an even number of elements, the median is actually the average
of the middle two elements. The middle two elements can't both be at the top. This is probably a workable algorithm, but let's come back to it.
• Heap? A heap is really good at basic ordering and keeping track of max and mins.
This is actually interesting—if you had two heaps, you could keep track of the bigger
half and the smaller half of the elements. The bigger half is kept in a min heap, such
that the smallest element in the bigger half is at the root.The smaller half is kept in a
max heap, such that the biggest element of the smaller half is at the root. Now, with
these data structures, you have the potential median elements at the roots. If the
heaps are no longer the same size, you can quickly "rebalance" the heaps by popping
an element off the one heap and pushing it onto the other.
Note that the more problems you do, the more developed your instinct on which data
structure to apply will be. You will also develop a more finely tuned instinct as to which of these approaches is the most useful.
