Efficient Product of 3 Sparse Matrices that creates a dense intermediate - algorithm

I have 3 matrices that are all sparse, A, B, and C.
I need to take the matrix product of AB, which results in a dense matrix.
After that, I need the element wise product of AB (element wise *) C.
C is sparse, and therefore the element wise multiplication will zero out most of the dense product AB, resulting in a sparse matrix again.
Knowing that, I am trying to figure out a strategy for not materializing all of the dense components of AB.
If C_{i,J} is 0, then I should not materialize AB_{i, j}. This means I can skip the dot product of A_{row i} and B_{col j}. But it seems very inefficient to write a for loop over rows of A to pick out the rows I want to materialize.
Could there be another way to intelligently do this multiplication?
Here is an example data generator in R, although the real product AB that I have is more dense than this generator. FWIW help from any programming language would be useful, not necessarily R. (Eigen would be great though!)
require(Matrix)
n = 10000
p = 100
A = rsparsematrix(n, p, .1)
B = rsparsematrix(p, p, .1)
C = rsparsematrix(n, p, .1)

This is pretty closely related to triangle counting. If A, B, and C were all binary matrices, then you could interpret them as the adjacency matrices for a tripartite graph and count for each edge in C how many triangles it belongs to.
Perhaps there's a triangle-counting community detection in R that could be adapted to your use case.
Underneath such a library is likely the following trick (that I should have a cite for, but don't offhand). It involves sorting the nodes of the graph by degree and directing all of the outgoing edges from low-degree to high. Then for each node, you test each pair of outgoing edges (wedge) for the edge that would complete it.

Related

How to build a graph from cops and robber problem?

This is a 2 part problem that I have given some thought on.
Problem Statement:
In a m by n rectangular field there is a robber R and two cops C1 and C2. Each of the
three start off at some intial square, and at the beginning of the chase, R, C1, C2 all
know each other's positions.
R makes the first move, then C1 and C2. They can only move up, down, left or right. Some
squares are inaccessible because there is an obstacle present. If C1 or C2 reach a square
that R is on then they catch R.
In order to escape, R must reach a square X on the perimeter of the grid. If R reaches the square X before it's caught by C1 or C2, then R successfully escapes. Else, R is unable to escape.
As input we are provided: Values of m (number of rows) and n (number of columns), initial coordinates for R, C1, C2, and a list of inaccessible squares.
I) Using the input provided, how can you use an adjacency list to construct a graph to solve the problem. Analyze the runtime of graph creation.
I was actually thinking of using a adjacency matrix because of the grid representation, but we are asked to use and adjacency list. As a result, I'm confused on what should be considered a vertex and edge in this problem. I was think that every square in the grid will be a vertex and its edges will be all of its neighboring squares, at least the ones it can reach, 4 squares being the maximum. So should my adjacency list store ALL m by n pairs and then for every pair maintain a linked list of neighbors, i.e. squares reachable? If I went with this route there will be (m * n) vertices, and then for each of those I would have to check which squares are reachable (up, down, left, right) and whether that square is inaccessible, so I would have to scan through the inaccessible list provided as input which would take O(n) time. So I guess that would put me up to O(m*n) running time for graph creation. Can I do better than this?
II) Given the graph you create in part (I) describe an algorithm to check if R can escape.
*Assumption: The strategy that R, C1 and C2 is negligible. It doesn't matter if R,C1,C2 move in the "smart" way or completely random.
Since R declares its destination before the chase begins I think it's just a matter of whether there exists a path from where R starts at to its destination square. So can I get away with running DFS and check if R can reach its destination? But, I don't know R will be able to avoid C1 and C2.
Guidance is appreciated.
Sounds like you pretty much know how to build the graph, but it's better to give each vertex a number instead of maintaining (m,n) tuples.
Allocate an array of N * M lists. Each position (x,y) on the grid will correspond to slot x+n*y in that array. That slot will contain a list of adjacent accessible numbers or null if its an obstacle.
For now, initialize the array with an empty list at every position
For each obstacle, set its corresponding array slot to null.
For grid position (x,y), if its a vertex (array[x+n*y]!=null), then check its neighbors to fill out its adjacency list. If array[x+1+n*y]!=null, for example, then the list at [x+n*y] would include [x+1+n*y].
The resulting representation is pretty compact and good for many purposes. Since the vertexes have degree <= 4, an adjacency list is much more efficient than an adjacency matrix.
The remaining part of your program will be greatly simplified as well, since it doesn't have to deal with coordinates or know anything about the original grid.
Unfortunately, the "*Assumption" takes all the fun out of the second part.

Grouping large set of similar vectors

I have a 3d mesh of ~200,000 triangles.
To find all the flat (or near enough flat) surfaces on the model I thought I could try and group triangles by their normal vectors (giving me ones which face the same way) and then I can search these smaller sets for ones which are similar in position or connected.
I cannot think of a good way to practically do this while also keeping things relatively speedy. I have come up with solutions which would take n² but none which are elegant and quicker than that.
I have vertex information and triangle information (vertices, centre and normal).
Any suggestions would be appreciated.
It is possible that I have misunderstood the problem so I am stating what I think you need to do : "Given a set of vectors, group parallel vectors together".
You could use a hash-map to solve this problem. I am assuming that you stored the normal vectors in the form:
a + b + c = 0
You just need to write a function that converts a vector to an integer, for example, if I know that 0 <= a, b, c <= 1000, then I can use F(a, b, c) = a + 1000b + 1000000c which guarantees unique integer for every unique vector. After this, its just a matter of creating a hashmap which maps some integer to a list and store all the parallel vectors in the same list.
You want to find connected components on the graph from your triangles. The only thing you need is to store adjacency information in a convenient form.
Create a list of all edges (min, max), if all edges have two triangles adjacent, then there are 300'000 edges. This can be done in linear time:
For every vertex count number of adjacent vertices with greater index, do the partial sum on these numbers.
Allocate and fill an array for edges (second vertex and utility data). Use array from step 1 to access edges adjacent to a vertex. Such an access can be done in the constant time if we know that the number of edges adjacent to a vertex is bounded from above by a constant and the whole step can be done in the linear time.
So, mentioned utility data is the numbers of pair of triangles adjacent to the edge.
Ok, now you have adjacency info. It is time to find connected components. You can use DFS for it. It will work in the linear time because every triangle has three (constant number of) neighbors.
Here you need to allocate 200'000 * sizeof(int) * 4 bytes. And it can be done in the linear time.
You could also want to read about doubly connected edge list, but it is pretty expensive.

Analytic geometry, ordering vertices of triangle to capture shortes and second sortest sides

If I have x and y coordinates for triangle corners A, B, and C, I want to know which of the six orderings of {A, B, C} put the shortest side of the triangle between the first two vertices in the ordering, and the second shortest side between the last two. I know how to solve this, but not in a way that isn't clumsy and inelegant and all around ugly. My favorite language is Ruby, but I respect all of them.
As the third side of a triangle cannot be deduced from the other two, you must compute the three distances.
As the three points may require to be permuted in one of six ways, you cannot work this out with a decision tree that has less than three levels (two levels can distinguish at most four cases).
Hence, compute the three distances and sort them increasingly using the same optimal decision tree as here: https://stackoverflow.com/a/22112521/1196549 (obviously their A, B, C correspond to your distances). For every leaf of the tree, determine what permutation of your points you must apply.
For instance, if you determine |AB|<|CA|<|BC|, you must swap A and B. Solve all six cases similarly.
Doing this you will obtain maximally efficient code.
If you are completely paranoid like I am, you can organize the decision tree in such a way that the cases that require a heavier permutation effort are detected in two tests rather than three.
Here's how I would do it: let's take a triangle with sides x, y, and z, such that l(x) <= l(y) <= l(z). Then, let x', y', and z' be the vertices opposite to x, y, and z, respectively.
Your output will be y', z', x' (if you draw out your triangle, you'll see that this is the order which achieves your requirement). So, the pseudocode looks like:
For points a, b, c each with some coordinates (x, y), calculate the length of the segment opposite to each point (e.g. for a this is segment bc)
Order a, b, c by the length of their opposing segment in the order of [2nd longest, longest, shortest]
Return
Does this make sense? The real work is mapping to the euclidean distance between the opposing points. If you get stuck, update your question with your code and I'm happy to help you work it out.

How to convert the half-spaces that constitute a convex hull to a set of extreme points?

I have a convex set in a Euclidean space (3D, but would like answers for nD) that is characterized by a finite set of half-spaces (normal vector + point).
Is there a better algorithm to find the extreme points of the convex set other than compute brute force all points that are intersections of 3 (or, n) half-spaces and eliminate those that are not extreme points?
The key term is vertex enumeration of a polytope P. The idea of the algorithm described below is to consider the dual polytope P*. Then the vertices of P correspond to the facets of P*. The facets of P* are efficiently computed with Qhull, and then it remains to find the vertices by solving the corresponding sub-systems of linear equations.
The algorithm is implemented in BSD-licensed toolset Analyze N-dimensional Polyhedra in terms of Vertices or (In)Equalities for Matlab, authored by Matt J, specifically, its component lcon2vert. However, for the purpose of reading the algorithm and re-implementing it another language, it is easier to work with the older and simpler con2vert file by Michael Kleder, which Matt J's project builds on.
I'll explain what it does step by step. The individual Matlab commands (e.g., convhulln) are documented on MathWorks site, with references to underlying algorithms.
The input consists of a set of linear inequalities of the form Ax<=b, where A is a matrix and b is a column vector.
Step 1. Attempt to locate an interior point of the polytope
First try is c = A\b, which is the least-squares solution of the overdetermined linear system Ax=b. If A*c<b holds componentwise, this is an interior point. otherwise, multivariable minimization is attempted with the objective function being the maximum of 0 and all numbers A*c-b. If this fails to find a point where A*c-b<0 holds, the program exits with "unable to find an interior point".
Step 2. Translate the polytope so that the origin is its interior point
This is done by b = b - A*c in Matlab. Since 0 is now an interior point, all entries of b are positive.
Step 3. Normalize so that the right hand side is 1
This is just the division of ith row of A by b(i), done by D = A ./ repmat(b,[1 size(A,2)]); in Matlab. From now on, only the matrix D is used. Note that the rows of D are the vertices of the dual polytope P* mentioned at the beginning.
Step 4. Check that the polytope P is bounded
The polytope P is unbounded if the vertices of its dual P* lie on the same side of some hyperplane through the origin. This is detected by using the built-in function convhulln that computes the volume of the convex hull of given points. The author checks whether appending zero row to matrix D increases the volume of the convex hull; if it does, the program exits with "Non-bounding constraints detected".
Step 5. Computation of vertices
This is the loop
for ix = 1:size(k,1)
F = D(k(ix,:),:);
G(ix,:)=F\ones(size(F,1),1);
end
Here, the matrix k encodes the facets of the dual polytope P*, with each row listing the vertices of the facet. The matrix F is the submatrix of D consisting of the vertices of a facet of P*. Backslash invokes the linear solver, and finds a vertex of P.
Step 6: Clean-up
Since the polytope was translated at Step 2, this translation is undone with V = G + repmat(c',[size(G,1),1]);. The remaining two lines attempt to eliminate repeated vertices (not always successfully).
I am the author of polco, a tool which implements the "double description method". The double description method is known to work well for many degenerate problems. It has been used to compute tens of millions of generators mostly for computational systems biology problems.
The tool is written in Java, runs in parallel on multicore CPUs and supports various input and output formats including text and Matlab files. You will find more information and publications about the software and the double description method via given link to a university department of ETH Zurich.

How to select points at a regular density

how do I select a subset of points at a regular density? More formally,
Given
a set A of irregularly spaced points,
a metric of distance dist (e.g., Euclidean distance),
and a target density d,
how can I select a smallest subset B that satisfies below?
for every point x in A,
there exists a point y in B
which satisfies dist(x,y) <= d
My current best shot is to
start with A itself
pick out the closest (or just particularly close) couple of points
randomly exclude one of them
repeat as long as the condition holds
and repeat the whole procedure for best luck. But are there better ways?
I'm trying to do this with 280,000 18-D points, but my question is in general strategy. So I also wish to know how to do it with 2-D points. And I don't really need a guarantee of a smallest subset. Any useful method is welcome. Thank you.
bottom-up method
select a random point
select among unselected y for which min(d(x,y) for x in selected) is largest
keep going!
I'll call it bottom-up and the one I originally posted top-down. This is much faster in the beginning, so for sparse sampling this should be better?
performance measure
If guarantee of optimality is not required, I think these two indicators could be useful:
radius of coverage: max {y in unselected} min(d(x,y) for x in selected)
radius of economy: min {y in selected != x} min(d(x,y) for x in selected)
RC is minimum allowed d, and there is no absolute inequality between these two. But RC <= RE is more desirable.
my little methods
For a little demonstration of that "performance measure," I generated 256 2-D points distributed uniformly or by standard normal distribution. Then I tried my top-down and bottom-up methods with them. And this is what I got:
RC is red, RE is blue. X axis is number of selected points. Did you think bottom-up could be as good? I thought so watching the animation, but it seems top-down is significantly better (look at the sparse region). Nevertheless, not too horrible given that it's much faster.
Here I packed everything.
http://www.filehosting.org/file/details/352267/density_sampling.tar.gz
You can model your problem with graphs, assume points as nodes, and connect two nodes with edge if their distance is smaller than d, Now you should find the minimum number of vertex such that they are with their connected vertices cover all nodes of graph, this is minimum vertex cover problem (which is NP-Hard in general), but you can use fast 2-approximation : repeatedly taking both endpoints of an edge into the vertex cover, then removing them from the graph.
P.S: sure you should select nodes which are fully disconnected from the graph, After removing this nodes (means selecting them), your problem is vertex cover.
A genetic algorithm may probably produce good results here.
update:
I have been playing a little with this problem and these are my findings:
A simple method (call it random-selection) to obtain a set of points fulfilling the stated condition is as follows:
start with B empty
select a random point x from A and place it in B
remove from A every point y such that dist(x, y) < d
while A is not empty go to 2
A kd-tree can be used to perform the look ups in step 3 relatively fast.
The experiments I have run in 2D show that the subsets generated are approximately half the size of the ones generated by your top-down approach.
Then I have used this random-selection algorithm to seed a genetic algorithm that resulted in a further 25% reduction on the size of the subsets.
For mutation, giving a chromosome representing a subset B, I randomly choose an hyperball inside the minimal axis-aligned hyperbox that covers all the points in A. Then, I remove from B all the points that are also in the hyperball and use the random-selection to complete it again.
For crossover I employ a similar approach, using a random hyperball to divide the mother and father chromosomes.
I have implemented everything in Perl using my wrapper for the GAUL library (GAUL can be obtained from here.
The script is here: https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_density.pl
It accepts a list of n-dimensional points from stdin and generates a collection of pictures showing the best solution for every iteration of the genetic algorithm. The companion script https://github.com/salva/p5-AI-GAUL/blob/master/examples/point_gen.pl can be used to generate the random points with a uniform distribution.
Here is a proposal which makes an assumption of Manhattan distance metric:
Divide up the entire space into a grid of granularity d. Formally: partition A so that points (x1,...,xn) and (y1,...,yn) are in the same partition exactly when (floor(x1/d),...,floor(xn/d))=(floor(y1/d),...,floor(yn/d)).
Pick one point (arbitrarily) from each grid space -- that is, choose a representative from each set in the partition created in step 1. Don't worry if some grid spaces are empty! Simply don't choose a representative for this space.
Actually, the implementation won't have to do any real work to do step one, and step two can be done in one pass through the points, using a hash of the partition identifier (the (floor(x1/d),...,floor(xn/d))) to check whether we have already chosen a representative for a particular grid space, so this can be very, very fast.
Some other distance metrics may be able to use an adapted approach. For example, the Euclidean metric could use d/sqrt(n)-size grids. In this case, you might want to add a post-processing step that tries to reduce the cover a bit (since the grids described above are no longer exactly radius-d balls -- the balls overlap neighboring grids a bit), but I'm not sure how that part would look.
To be lazy, this can be casted to a set cover problem, which can be handled by mixed-integer problem solver/optimizers. Here is a GNU MathProg model for the GLPK LP/MIP solver. Here C denotes which point can "satisfy" each point.
param N, integer, > 0;
set C{1..N};
var x{i in 1..N}, binary;
s.t. cover{i in 1..N}: sum{j in C[i]} x[j] >= 1;
minimize goal: sum{i in 1..N} x[i];
With normally distributed 1000 points, it didn't find the optimum subset in 4 minutes, but it said it knew the true minimum and it selected only one more point.

Resources