How to convert large bipartite graph users-items into items-items? - algorithm

I have a very large (10M+ edges, ~5M vertexes) bipartite undirected users-items graph in format
item1: user1, user2, user3, ...
or
userX1: itemY1
userX2: itemY2
...
I need to convert my graph into the items-items graph where weight of edge between i and j vertexes is equal to the number of users using both items simultaneously (i.e number of elements in the intersection of sets of vertexes adjacent to item_i and item_j). And here is the problem it seems it would require me to do $O(n^2)$ operations, where $n$ is the number of edges in graph which would be impossible on simple home pc i current own. Is there any solution to this? Some probabilistic data structure will suit my needs just fine because I'm allowed to lose some small percentage of data.

Notation: m = number of edges in the old graph
Sort the edge list E (version two of your proposed formats) by users (takes O(m log(m))
Go through E and determine all the runs of consecutive edges with same user (O(m))
Each pair of edges within a run gives you an edge in the new graph -> Add it to the edge list F of the new graph (O(|F| = n_user x max_n_items_per_user^2))
Contract all duplicates in F into single edges with the edge weights given by the number of duplicates (O(|F|))
If your graph is sparse, i.e. max_n_items_per_user is small, the above should be fairly efficient. Otherwise, this algorithm has the problem that it enumerates all edges in the new graph before contracting them, so one should think about how to get the edge weight directly.

Related

Finding size of largest connected component of a graph

Consider we have a random undirected graph G = (V,E) with n vertices, now suppose for any two vertices u and v ∈ V, the probability that the edge between u and v ∈ E is 1/n. We need to figure out the size of the largest connected component in the undirected graph C(n).
C(n) should be equal to Θ(n**a), we need to run some experiments to give an estimate of a.
I am a bit confused on how to link the probability 1/n to the largest connected component, is there any way I can do so?
The process you're simulating here is called the Erdős–Rényi model. You have a collection of n nodes, and each pair of nodes has probability p of being linked. The (expected) shape of the resulting graph depends heavily on the choice of p, and there are a lot of famous results about this.
As for how to do this: one option would be to create a collection of n nodes, iterate over all pairs of nodes, and link them with probability 1/n. You can then run an algorithm like BFS or DFS over the graph to find and size the connected components.
Another would be to use the above approach, except that instead of doing a BFS or DFS to use a disjoint-set forest to perform the links and find the largest connected component.
Alternatively, because each edge is absent or present with equal probability and independently of every other edge, the number of edges you have is binomially distributed and pretty tightly packed around n total edges. You could therefore generate n random edges, add them into the graph, then use the above techniques. (This will be much faster, as this does O(n) work rather than O(n2) work to process the edges.)
Once you've gotten this worked out, you can vary n over a large range and run some sort of polynomial regression on it to find the best-first curve. That's something you could either code up yourself, or which you could do by importing your data into Excel and using its regression tools.
As a spoiler, when you're done you'll find that the number of nodes in the largest connected component is Θ(n2/3). If you search for "Erdős–Rényi critical case," you can find online proofs of this result. It's not a trivial result to prove (and definitely isn't obvious!), but it'll drop out of your empirical analysis.

Linear algorithm to make edge weights unique

I have a weighted graph, and I want to compute a new weighting function for the graph, such that the edge weights are distinct, and every MST in the new graph corresponds to an MST in the old graph.
I can't find a feasible algorithm. I doubled all the weights, but that won't make them distinct. I also tried doubling the weights and adding different constants to edges with the same weights, but that doesn't feel right, either.
The new graph will have only 1 MST, since all edges are distinct.
Very simple: we multiply all weights by a factor K large enough to ensure that our small changes cannot affect the validity of an MST. I'll go overboard on this one:
K = max(<sum of all graph weights>,
<quantity of edges>)
+ 1
Number the N edges in any order, 0 through N-1. To each edge weight, add the edge number. It's trivial to show that
the edge weights are now unique (new difference between different weights is larger than the changes);
any MST in the new graph maps directly to a corresponding MST in the
old one (each new path sum is K times the old one, plus a quantity smaller than K -- the comparison (less or greater than) on any two paths cannot be affected).
Yes, this is overkill: you can restrict the value of K quite a bit. However, making it that large reduces the correctness proofs to lemmas a junior-high algebra student can follow.
We definitely cannot guarantee that all of the MSTs in the old graph are MSTs in the new graph; a counterexample is the complete graph on three vertices where all edges have equal weights. So I assume that you do not require the construction to give all MSTs, as that is not possible in the general case.
Can we always make it so that the new graph's MSTs are a subset of the old graph's? This would be easy if we could construct a graph without a MST. Of course, that doesn't make any sense and is impossible, since all graphs have at least one MST. Is it possible to change edge weights so that only one of the old graph's MSTs is an MST for the new graph? I propose that this is possible in general.
Determine some MST of the old graph.
Construct a new graph with the same edges and vertices, but with weights assigned as follows:
if the edge in the new graph belongs to the MST determined in step 1, give it a unique weight between 1 and n, the number of edges in the graph.
otherwise, give the edge in the new graph a unique weight greater than or equal to n^2, the square of the number of edges in the graph.
Without proof, it seems like this should guarantee that only the nominated MST from the old graph is an MST of the new graph, and therefore every MST in the new graph (there is just the one) is an MST in the old graph.
Now, one could ask whether we can do the deed with additional restrictions:
Can you do it if you only want to change the values of edges that are not unique in the old graph?
Can you do it if you want to keep relative weights the same for edges which were unique in the old graph?
One could even pose optimization problems:
What is the minimum number of edge weights that must be changed to guarantee it?
What is the weighting with minimum distance (according to some metric) from the old weighting that guarantees it?
What is the weighting that minimizes the average change while also guaranteeing it?
I am hesitant to attempt these, what I believe to be much more difficult, problems.

Maximum Independent Subset of 2D Grid Subgraph

In the general case finding a Maximum Independent Subset of a Graph is NP Hard.
However consider the following subset of graphs:
Create an NxN grid of unit square cells.
Build a graph G by creating a vertex corresponding to every cell. Notice that there are N^2 vertices.
Create an edge between two vertices if their cells share a side. Notice there are 2N(N-1) edges.
A Maximum Independent Subset of G is obviously a checker pattern. A cell at the Rth row and Cth column is part of it if R+C is odd.
Now we create a graph G' by copying G and removing some vertices and edges. (If you remove a vertex also remove all edges it ended of course. Also note you can remove an edge without removing one of the vertices it ends.)
By what algorithm can we find a Maximum Independent Subset of G' ?
Read up here. I think you're still hosed, it remains NP-hard.
Since your degree is at most 4, the simple greedy algorithm obtains an approximation ratio of 2. Your resulting graph is also planar, so there is a good approximation algorithm (any fixed approximation ratio in poly time).

Deriving a weighted graph from groups of existing nodes - is there a smarter way?

I have an undirected weighted graph and derive a new graph based on groups of existing nodes: each group becomes a node in the new graph, connected to others based on the total weight of the edges between the original nodes.
In the current data structure, each node has a list of its neighbors and weights, and the algorithm takes each group / each node in group / each edge in node, and sums up the weights in order to determine the edges of the new graph.
This algorithm works fine, but it's slow - is there a way to avoid the 3-level iterations?
Keeping a single list of edges is an option, but when the new graph is built, the new list of edges would have to be scanned at each step to see if that edge already exists (and to increment its weight).
If O(E + Ngroups^2) is an option where E is the number of edges in the given graph and Ngroups is the number of groups, you can do it as follows.
You can create an adjacency matrix for the resulting graph. Then loop through all the edges in the given graph like
for each edge u in Graph
for v such that edge u->v exists
let w be the weight of u->v
A[group(u)][group(v)] := A[group(u)][group(v)] + w;
You can rebuild the new graph from adjacency matrix if you wish.
One possible optimization is to use a balanced tree to hold the edges starting in a particular node group instead of the whole adjacency matrix, this would lead to O(E*log(Ngroups)) time complexity.

Bounding the number of edges between star graphs such that graph is planar

I have a graph G which consists only of star graphs. A star graph consists of one central node having edges to every other node in it. Let H1, H2,…,Hn be different star graphs of different sizes which are present in G. We call the set of all nodes which are centres in any star graph R.
Now suppose these star graphs are building edges to other star graphs such that no edge is incident between any nodes in R. Then, how many edges exist at maximum between the nodes in R and the nodes which are not in R, if the graph should remain planar?
I want the upper bound on the number of such edges. One upper bound that I have in mind is: consider them as bipartite planar graph where R is one set of vertices and rest of the vertices form another set A. We are interested in edges between these sets (R and A). Since it is planar bipartite, the number of such edges is bounded by twice the number of nodes in G.
What I feel is that is there a better bound, maybe twice the nodes in A plus the number of nodes in R.
In case you can disprove my intuition, then that would also be good. Hopefully some of you can come up with a good bound along with some relevant arguments.
That's the best you can do. Take any planar graph G and construct its face-vertex incidence graph H, whose faces all have 4 edges. Let R be the set of faces of G and construct stars any which way using edges in H. This achieves the bound for bipartite planar graphs.

Resources