Travelling around a graph and making predictions along the way - algorithm

I'm a student doing research and am having trouble proceeding because I don't have much experience in this field, so even if you can't answer the questions but do know important terms I should be looking for that would be a big help.
If we have an agent travelling around a graph, going from node to node, are there algorithms (and if so, the names of algorithms) that do the following:
Predict the topology of the graph using bayesian statistics, which assumes the graph is finite and what has been seen before increasingly represents what will be seen in the future.
Can predict labellings from previous labellings. So if we have a chain of 26 nodes, with the first node being A, the second B, so on and so, then at some point you should be able to predict that the labellings are in alphabetical order long before we reach the end.
Use the labellings to predict the topology of the graph; so if I have a graph where so far it is a chain and the labellings are clearly alphabetical order, then a good guess for the rest of the graph would be that it is a chain.
Relevant API would be fantastic

Related

Graph Topology Profiling

Can anyone suggest me some algorithms that can be used to analyze the graph topology classification?
Input: Adjacency list with raw graph information.
Output : What kind of graph is it? Currently I want to focus only on Pure Types - Daisy chain, Mesh, Ring, Star, Tree.
Which area of algorithm study is responsible for such algorithm? Is it Computational Geometry?
Edit - The size of graph will not exceed 32 nodes. However, there will be redundant links between nodes.
Edit - I understand that my question might be too broad, but at least give me the clue of what is wrong with the question before down-voting it. Or is it because of my reputation :-(
Start by checking that your graph is fully connected.
Then, check the distribution of the nodes' degree:
Ring: All nodes would have degree 2
Daisy chain: all nodes would have degree 2 except for 2 nodes with degree 1 (there are alternative definitions for what a daisy chain is).
Star: Each node would have degree 1, except for one node with degree n-1
Tree: The sum of the degrees is 2*(number of nodes-1). Also, if the highest degree is k, then there are at least k nodes with degree 1.
Mesh: Anything goes...
I don't think there is a 'area' of algorithms that deals with such problems, but the term 'graph classes' is quite common (See for example here), though it is not a formal term.
To classify a new instance, you need a classification system in the first place!
Putting it another way, your graph (the item to classify) fits somewhere in some kind of data structure of graph topologies (the classification system). The system could be as simple as a list; in which case, you carry out the simple algorithm outlined in this other post where the list of topologies is keyed by degree distribution.
A more complex system could be a hierarchical one, similar to biological classification systems. This would only really be necessary for very large numbers of graph topologies, where it would make it faster to classify based on a series of decisions. Essentially a decision tree.
It may be difficult to find much research in this area (for pure graphs) as it's a little hard to think of applications. There are applications for protein fold topologies, but that may not be of interest.

Travelling by bus

If you have the full bus schedule for a country, how can you find the
furthest anyone can travel in one day without visiting the same stop twice?
I assume a bus schedule gives you the full list of leaving and arriving times for every bus stop.
A slow and naive method would be as follows.
You can of course make a graph from the bus schedule with multiple directed edges between bus stops. You could then do a depth first search remembering the arrival time of the edge you took to get to each node and only taking edges from that stop that leave after the one that you took to get there. If you go to a node you have been to before you would only carry on from there if the current time in your traversal is before the earliest time you had ever visited that node before. You could record the furthest you can get from each node and then you could check each node to find the furthest you can travel overall.
This seems very inefficient however and it really isn't a normal graph problem. The problem is that in a normal directed graph if you can get from A to B and from B to C then you can get from A to C. This isn't true here.
What is the fastest you can solve this problem?
I think your original algorithm is pretty good.
You can think of your approach as being a version of Dijkstra's algorithm, in attempting to find the shortest path to each node.
Note that it is best at this stage to weight edges in the graph in terms of time. The idea is to use your Dijkstra-like algorithm to compute all nodes reachable within 1 days worth of time, and then pick whichever of these nodes is furthest in space from the start point.
Implementations of Dijkstra can use a heap to retrieve the next node to explore in O(logn), and I think this would be a good enhancement to your approach as well. If you always choose the node that you can reach earliest, you never need to repeat the calculation for that node.
Overall the approach is:
For each starting point
Use a modified Dijkstra to compute all nodes reachable in 1 day
Find the furthest in space of all these nodes.
So for n starting points and e bus routes, the complexity is about O(n(n+e)log(n)) to get the optimal answer.
You should be able to get improved performance by using an appropriate heuristic in an A* search. The heuristic needs to underestimate the max distance possible from a point, so you could use the maximum speed of a bus multiplied by the remaining time.
Instead of making multiple edges for each departure from a location, you can make multiple nodes per location / time.
Create one node per location per departure time.
Create one node per location per arrival time.
Create edges to connect departures to arrivals.
Create edges to connect a given node to the node belonging to the same location at the nearest future time.
By doing this, any path you can traverse through the graph is "valid" (meaning a traveler would be able to achieve this by a combination of bus trips or choosing to sit at a location and wait for a future bus).
Sorry to say, but as described this problem has a pretty high complexity. Misread the problem originally and thought it was np-hard, but it is not. It does however have a pretty high complexity that I personally would not want to deal with. This algorithm is a pretty good approximation that give a considerable complexity savings that I personally think it worth it.
However, if all you want is an answer that is "pretty good" there are are lot of fairly efficient algorithms out there that will get close very quickly.
Personally I would suggest using a simple greedy algorithm here.
I've done this on a few (granted, small and contrived) examples and it's worked pretty well and has an nlog(n) efficiency.
Associate a velocity with each node, velocity being the fastest you can move away from a given node. In my examples this velocity was distance_travelled/(wait_time + travel_time). I used the maximum velocity of all trips leaving a node as the velocity score for that node.
From your node/time calculate the velocities of all neighboring nodes and travel to the "fastest" node.
This algorithm is pretty good for the complexity as it basically transforms the problem into a static search, but there are a couple potential pitfalls that could be adjusted for depending on your data set.
The biggest issue with this algorithm is the possibility of a really fast bus going into the middle of nowhere. You could get around that by adding a "popularity" term to the velocity calculation (make more popular stops effectively faster) but depending on your data set that could easily make things either better or worse.
The simplistic graph representation will not work. I. e. each city is a node and the edges represent time. That's because the "edge" is not always active -- it is only active at certain times of the day.
The second thing that comes to mind is Edward Tufte's Paris Train Schedule which is a different kind of graph. But that does not quite fit the problem either. With the train schedule, the stations have a sequential relationship between stations, but that's not the case in general with cities and bus schedules.
But Tufte motivates the following way to model it as a graph. You could write code only to construct the graph and use a standard graph library that includes the shortest path algorithm.
Each bus trip is an edge with weight = distance covered
Each (city, departure) and (city, arrival) is a node
All nodes for a given city are connected by zero-weight edges in a time-ordered sequence, ignoring whether it is an arrival or a departure. This subgraph will look like a chain.
(it is a directed graph)
Linear Time Solution: Note that the graph will be a directed, acyclic graph. Finding the longest path in such a graph is linear. "A longest path between two given vertices s and t in a weighted graph G is the same thing as a shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G."
Hope this helps! If somebody can post a visualization of the graph, it would be nice. If I can do so myself, I will do 1 more edit.
Naive is the best you'll get -- http://en.wikipedia.org/wiki/Longest_path_problem
EDIT:
So the problem is two fold.
Create a list of graphs where its possible to travel from pointA to pointB. Possible is in terms of times available for busA to travel from pointA to pointB.
Find longest path from all the possible generated path above.
Another approach would be to reevaluate the graph upon each node traversal and find the longest path.
It still reduces to finding longest possible path, which is NP-Hard.

Finding a minimum/maximum weight Steiner tree

I asked this question on reddit, but haven't converged on a solution yet. Since many of my searches bring me to Stack Overflow, I decided I would give this a try. Here is a simple formulation of my problem:
Given a weighted undirected graph G(V,E,w) and a subset of vertices S in G, find the min/max weight tree that spans S. Adding vertices is not allowed. An extension of the basic model is adding edges with 0 weight, and vertices that must be excluded. This seems similar to the question asked here:
Algorithm to find minimum spanning tree of chosen vertices
There is also more insight into what values the edges can take. Each edge is actually a correlation probability, which I can encode in several ways, so the main questions I want to ask the graph are:
Given k vertices that must be connected, what are the top X min/max spanning trees that connect them, and what vertices do they pass through? As I understand it, this is the same question as asking the graph what is the highest probability of connecting all of the k vertices.
Getting more vague, is there a logical way to cluster the nodes?
As for implementation, I have the boost libraries installed, and once I get the framework rolling on this problem, I can deal with how to multi-thread it (if appropriate), what kind of graph to use, and how to store/cache the data, since the number of vertices and edges is going to be quite large.
Update
Looking at the problem I am trying to solve, it makes sense that it would be NP-complete. The real world problem that I am trying to solve involves medical diagnoses; specifically when the medical community is working on a problem with a specific idea in mind, and they need to take a step back and reconsider how they got there. What I want from the program I am trying to design is:
Given several conditions, tests, symptoms, age, gender, season, confirmed diagnosis, timeline, how can you relate them? What cells/tissues/organs/systems are touched? Are they even related?
Along with the defined groups that conditions/symptoms can belong to, is there a way to logically group the conditions/symptoms?
Example
Flu-like symptoms, red eyes, early pneumonia, and some of the signs of diabetes. Is there a way to relate all of the symptoms? Are there some tests that could be done to make it easier to determine? What systems are involved?
It just seemed natural to try and map this to a graph, or several graphs, and use probabilities as the correlation between different symptoms/conditions.
I have seen models for your problem that were mostly based on Bayesian inference and fuzzy logic. Bayesian inference networks express the relation between causes and effects e.g. smoking and lung cancer. Look here for a quick tutorial. You can apply fuzzy logic to that modelling to try to take into account the variablility in real life (as not everyone gets lung cancer).

What is a good measure of strength of a link and influence of a node?

In the context of social networks, what is a good measure of strength of a link between two nodes? I am currently thinking that the following should give me what I want:
For two nodes A and B:
Strength(A,B) = (neighbors(A) intersection neighbors(B))/neighbors(A)
where neighbors(X) gives the total number of nodes directly connected to X and the intersection operation above gives the number of nodes that are connected to both A and B.
Of course, Strength(A,B) != Strength(B,A).
Now knowing this, is there a good way to determine the influence of a node? I was initially using the Degree Centrality of a node to determine its "influence" but I somehow think its not a good idea because just because a node has a lot of outgoing links does not mean anything. Those links should be powerful as well. In that case, maybe using an aggregate of the strengths of each node connected to this node is a good idea to estimate its influence? Am I in the right direction? Does anyone have any suggestions?
My Philosophy (and understanding of the terms):
Strength indicates how far A is
willing to do what B has already done
Influence indicates how far A can make B do something (persuasion perhaps?)
Constraints:
Access to only a subgraph. I mean, I am trying to be realistic here because social networks are huge and having a complete view is not so practical.
you might want to check out some more sophisticated notions of distance.
A really cool one is "resistance distance", which lets you view distance as how likely a random path from one node will lead you to another
there are several days of lecture notes plus references to further reading at http://www.cs.yale.edu/homes/spielman/462/.
Few thoughts on this:
When you talk about influence of a node in a graph one centrality measurement that comes to mind it closeness centrality. Closeness centrality looks at the number of shortest paths in a graph the node is on. From an influence point of view, the node that is on the most shortest paths is the node that can share information the easiest, ie its nearer to more nodes than any other.
You also mention using the strengths of each node connected to a node. Maybe you should look at eigenvector centrality which ranks a node highly if its connected to other high degree nodes. This is an undirected version of PageRank.
Some questions that might affect you choice here are:
Is you graph directed?
Do you edges have weight? You mention strength... do you mean weights of some kind?
If you do have weights maybe the next step from a simple degree centrality would be to try a weighted degree centrality approach. Thus, just having a high number of connections doesn't automatically make you the most influential.

Creating a "crossover" function for a genetic algorithm to improve network paths

I'm trying to develop a genetic algorithm that will find the most efficient way to connect a given number of nodes at specified locations.
All the nodes on the network must be able to connect to the server node and there must be no cycles within the network. It's basically a tree.
I have a function that can measure the "fitness" of any given network layout.
What's stopping me is that I can't think of a crossover function that would take 2 network structures (parents) and somehow mix them to create offspring that would meet the above conditions.
Any ideas?
Clarification: The nodes each have a fixed x,y coordiante position. Only the routes between them can be altered.
Amir- I think the idea is that every generated tree will contain the same set of nodes, arranged in a different order.
Perhaps rather than using a crossover-based genetic algorithm, you'd be better off using a less biologically inspired hill-climbing algorithm? Define a set of swaps (trading children between nodes, for example) to act as possible mutations, and then iteratively mutate and check against your fitness function. As is the case with all searches of this kind, you're vulnerable to getting stuck in local maxima, so many runs from different starting positions is a good idea.
Let me begin by answering your question with a question :
How does the fitness function behave if you create a network layout that violates the 'no cycle' rule and the 'connect to server' rule?
If it simply punishes the given network layout via a poor fitness score, you don't need to do anything special except take two network layouts and cross them over, 1/2 from layout A, 1/2 from layout B. That's a very basic cross over function, and it should work.
If however, you are responsible for constructing a valid layout and cannot rely on invalid layouts simply being weeded out, you'll need to do more work.
Sounds like you need to create a Minimum Spanning Tree network. I know that this doesn't really answer your genetic algorithm question, but this is quite a well understood problem. Two classical methods are Prim and Krustal. Maybe these algorithms and the methods they use to select edges to connect might give you some clues. Maybe the genes don't describe the network but instead the likelihood of connecting nodes via particular edges? Or a way of picking a node to connect in next?
Or just check out someone who's done this before, or this.
The purpose of crossover in genetic algorithms is to potentially mix good partial solutions from one parent with those of another. One way to think about partial solutions in this case may be subtrees of closely connected nodes. If your fitness function is fairly smooth in regards to small changes to localized parts of the overall tree, this may be a useful way to think about crossover.
Given this, one possible form of crossover would be the following:
Start with two parent trees, P1, and P2. Select two nodes randomly (possibly with some kind of enforcement on minimum distance between the nodes), N1 and N2.
On a node-by-node basis, "grow" a tree C1 outwards from N1 according to the linkages in P1, while simultaneously growing another tree outwards from N2 starting with P2. Do not add the same node to both trees - keep the sets of nodes entirely disjoint. Continue until all nodes have been added to either C1 or C2. This gives us the "traits" from each parent to recombine, in a form guaranteed to be acyclic.
Recombination is accomplished by adding an additional link, from C1 to C2, to create the new child C. As for which link to choose, this can be randomly selected (either uniformly or according to some distribution), or it could be selected by a greedy algorithm (based on some heuristic, or the overall fitness of the new tree C).
You could check on the crossover operators, which make sure that you have no repeating nodes in the child chromosomes. Couple of those crossover operators are the Order Crossover (OX) and the Edge Crossover operators. Such crossover operators are also helpful in solving TSP using GAs.
And after that you will have to check if you are getting any cycles or not. If yes, generate a new pair of chromosome. This is brute force of course.
An idea to try: encode nodes' positions in a metric space (e.g. 3-dimensional euclidean space). There are no "incorrect" assignments, so crossover is never destructive. From such an assignment you can always build a nearest neighbor tree, a minimum spanning tree or similar.
This is just example of a more general idea: do not encode the tree directly, encode some information from which a tree can always be constructed. The tricky part is to do it in such a way that the child trees keep important properties of the parents.
There was a paper in one of the early conferences that proposed the following algorithm for Traveling Salesman, which I have adapted for several graph problems with success:
Across the entire POPULATION, calculate and sort the nodes by descending number of connections (in other words, if N0 is connected in some individuals to N1, N2, N3, it has 3 connections, if N1 is always connected to N4, it has only 1).
Initially, take the node with the highest count. Call this the current_gene_node. (Say, N0)
LOOP:
Add current_gene_node to your offspring.
Remove that node from the lists of connections. (No cycles, so remove N0 from further consideration.)
If current_gene_node has no connections in the population, choose a random unchosen node in the population (mutation)
Else, from the list of connections for that node, do a lottery selection based on the prevalence of connections across the population (If current_gene_node = N0, and connections N0 are, say, N1 = 50%, N2 = 30%, N3 = 20% -- N1 has a 50% chance of being next current_gene_node).
Go to LOOP until all nodes connected
It's not really genetic in the sense of choosing directly from 2 parents, but it follows the same mathematical pressure of selecting based on population prevalence. So it's "genetic enough" for me and for me it's worked pretty well :-)

Resources