How does the make "-j" option actually work? - makefile

From the man pages:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option,
the last one is effective. If the -j
option is
given without an argument, make will not limit the number of jobs
that can run simultaneously.
I know it uses a graph dependency to know which rules are independent.
I would like to know how this graph is built and understand what are the criteria used.
thanks.

The dependency graph is based, as one would expect, on the prerequisites listed for each Makefile target. make will build a graph where the targets and prerequisites are vertices and there is a directed edge from the prereqs to their targets. In this way the number of incoming edges tells you how many prereqs a target has. If it has no incoming edges, then it has no prerequisites.
The vertices for the .c and .h files, for example, will have no incoming edges. Those files are your source files and do not need to be built.
It then performs a topological sort on the graph to determine the order of execution. From Wikipedia:
The canonical application of topological sorting (topological order) is in scheduling a sequence of jobs or tasks; topological sorting algorithms were first studied in the early 1960s in the context of the PERT technique for scheduling in project management (Jarnagin 1960). The jobs are represented by vertices, and there is an edge from x to y if job x must be completed before job y can be started (for example, when washing clothes, the washing machine must finish before we put the clothes to dry). Then, a topological sort gives an order in which to perform the jobs.
The gist of a topological sort is to find the vertices with no incoming edges (no dependencies) and put those first. Then remove them from the graph. Now you'll have a new set of vertices with no incoming edges (no dependencies). Those are next. And so on until finished. (If you ever reach a point when there are no such vertices then the dependency graph contains a cycle, which is an error condition.)
In a typical Makefile this means you'll first build the source files (nothing needs to be done). Then the object files that depend on those source files. Then the libraries and executables built from those object files.
Under normal non-parallel operation make will simply pick a single target each iteration and build it. When it is parallel it will grab as many dependency-less targets as it can and build them in parallel, up to the number of permitted simultaneous jobs.
So when make gets to, say, the object file step, it will have a large number of vertices in the graph that all have no incoming edges. It knows it can build the object files in parallel and so it forks off n copies of gcc to build the object files.

I suspect you're expecting something more magical than there actually is here. Makefiles contain lines like:
target: prereq1 prereq2 prereq3 ...
This defines a relationship between files in the system; in graph parlance, each white-space-separated word on the line implicitly declares a node in the graph, and a directed edge is created between each of the nodes to the left of the colon and each of the nodes to the right of the colon, pointing from the latter to the former.
From there it's a simple matter of traversing the graph to find nodes that have no incoming edges and executing the commands associated with those nodes, then working back 'up' the graph from there.
Hope that helps.

Related

Most probable path taken to reach a particular node in graph

we have a system where the customer comes and interacts, triggers jobs and does many actions. We have 1000s of such users. Each job has a name and our backend database has all the data about the customer interactions.
These jobs fail often. We know why a particular job failed based on its inputs, but now we want to find what was the path taken by user (journey) before he reached the failure job. We want to see if we can improve the experience much before so that the failure is avoided.
Example (hypothetical), login->create file-> save file -> download file. Download file is failing with some error. Say this usually happens when a save has just completed. If you have done some operation between save file and download, then down load does not fail. That is a hidden bug possibly.
The question is - Given a history of 3000 users graph traversal (take paths of size 5 [as a moving window]) build a system that when asked **
"what are the most probable paths to reach node X"
gives the top 5 most probably paths to reach X.
I have created the nodes as [jobName][State], example, loginSuccess->createFileSuccess->SaveFileSuccess->DownloadFailed. X will be typically a [Job Name]Failed node that we will query.
We have about 50 jobs and 3 states, success, failed cancelled.
Any idea how to build this model, which algorithm to use, and how to reverse generate the probabilities when a node is asked?
Adding some more clarity -
Given a target node, can I list what were the most probable paths to
reach it with length 5. I dont know the starting points to start the
dijkstra's. Also a direct path of low probability might exit from a
given starting node, directly to the target node, but I need to find
paths of length 5.
The first step I would take would be to construct a list of records of length 5, where each such record contains the 5 steps taken by a particular customer leading up to node X. Then you could simply sort this list and count the number of times each possible record occurs in it, to work out the most popular records.
Another approach would be to assign each edge exiting a node a score which was the fraction of paths that exited that node to exit it via that edge. Then compute the overall score for a path by multiplying together the scores for its edges, and again take the observed paths with highest scores.
From what I have understood you need to find path most likely followed by users and you can make nodes for each process and two process are connected to each other if a customer goes from one process to other.
STEP 1. Construct a graph for all 3000 users which will be a weighted graph
(as such weight of an edge will be number of times a user goes from
one process to another, so each time you find an already built edge
increment its weight by 1 or else make a new edge with weight =1)
Now, to find most probable path from source node to another
STEP 2. Apply Dijkstra's algorithm but with small change.
Dijkstra's algo find smallest path from one node to every other
node,so you need to find maximum path from one node to another.
I think, it should work as all the edges have positive weight and it will give you the most probable path taken from one node to another by all users and you could easily get data of all nodes from source to destination node very easily.
But it will only give you the most probable path and not top 5 of them.

Tool that outputs the longest chain of calls

The context: I am carrying out the analysis procedure described here: approach.
The blocking point is finding the "longest chain of calls" for the project under observation. What tool can be used for finding this? I presume it will be a static analysis tool.
Otherwise, I presume a call graph generator can be used for this purpose. But then, how to infer the longest chain of calls?
The "longest" call chain in terms of hops is straightforward to determine from a call graph. This assumes you have acquired one, and yes this generally requires static analysis. (You can get a dynamically-generated call graph, but it likely won't exhibit all possible calls). Acquiring one for firefox, with function calls that cross language boundaries, if you want to take those into account, may be pretty challenging.
Given such a call graph:
Start with the call graph with blank call-path-length values for each node. Mark the root(s) [your call graph may be a DAG] with the call-path-length value 0. For each unlabelled child, for which all parents have been determined, label that child with the max of the values from parents, plus 1. Repeat until all children are labelled. [If you have a cycle in the call graph, you have to decide to ignore it, or treat it as some constant addition to the path length.] The longest path is easily extracted by starting with the highest-value node, and walking backwards to the most expensive parents until a root is reached.
[EDIT] if you back propagate leaf values to parents, you can end up with the graph with every node labelled as to how far it reaches.
Are you sure you want longest-chain? Or do you want Worst Case Execution time?

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

An algorithm to check if a vertex is reachable

Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.

How do I find all paths through a set of given nodes in a DAG?

I have a list of items (blue nodes below) which are categorized by the users of my application. The categories themselves can be grouped and categorized themselves.
The resulting structure can be represented as a Directed Acyclic Graph (DAG) where the items are sinks at the bottom of the graph's topology and the top categories are sources. Note that while some of the categories might be well defined, a lot is going to be user defined and might be very messy.
Example:
(source: theuprightape.net)
On that structure, I want to perform the following operations:
find all items (sinks) below a particular node (all items in Europe)
find all paths (if any) that pass through all of a set of n nodes (all items sent via SMTP from example.com)
find all nodes that lie below all of a set of nodes (intersection: goyish brown foods)
The first seems quite straightforward: start at the node, follow all possible paths to the bottom and collect the items there. However, is there a faster approach? Remembering the nodes I already passed through probably helps avoiding unnecessary repetition, but are there more optimizations?
How do I go about the second one? It seems that the first step would be to determine the height of each node in the set, as to determine at which one(s) to start and then find all paths below that which include the rest of the set. But is this the best (or even a good) approach?
The graph traversal algorithms listed at Wikipedia all seem to be concerned with either finding a particular node or the shortest or otherwise most effective route between two nodes. I think both is not what I want, or did I just fail to see how this applies to my problem? Where else should I read?
It seems to me that its essentially the same operation for all 3 questions. You're always asking "Find all X below node(s) Y, where X is of type Z". All you need is a generic mechanism for 'locate all nodes below node', (solves Q3) and then you can filter the results for 'nodetype=sink' (solves Q1). For Q2, you have the starting-point (your node set) and your ending point (any sink below the starting point) so your solution set is all paths from starting node specified to the sink. So I would suggest that what you basically have a is a tree, and basic tree-traversal algorithms would be the way to go.
Despite the fact that your graph is acyclic, the operations you cite remind me of similar aspects of control flow graph analysis. There is a rich set of algorithms based on dominance that may be applicable. For example, your third operation reminds me od computing dominance frontiers; I believe that algorithm would work directly if you temporarily introduce "entry" and "exit" nodes. The entry node connects the "given set of nodes" and the exit nodes connects the sinks.
Also see Robert Tarjan's basic algorithms.

Resources