We have social graph that is later broken to clusters of high cohesion. Something called Truss by Jonathan Cohen [1].
Now that I have those clusters, I would like to come up with names for them.
Cluster name should allow insignificant changes to the cluster size without changing the name.
For example:
Let's assume we have cluster M:
M : {A, B, C, D, E, F}
and let's assume that "naming algorithm" generated name " m " for it.
After some time, vertex A has left the cluster, while vertex J has joined:
M : {B, C, D, E, F, J}
Newly generated name is " m' ".
Desired feature:
m' == m for insignificant cluster changes
[1] http://www.cslu.ogi.edu/~zak/cs506-pslc/trusses.pdf
Based on your example, I assume you mean "insignificant changes to the cluster composition", not to the "cluster size".
If your naming function f() cannot use the information about the existing name for the given cluster, you would have to allow that sometimes it does rename despite the change being small. Indeed, suppose that f() never renames a cluster when it changes just a little. Starting with cluster A, you can get to any other cluster B by adding or removing only one element at a time. By construction, the function will return the same name for A and B. Since A, B were arbitrary, f() will return the same name for all possible clusters - clearly useless.
So, you have two alternatives:
(1) the naming function relies on the existing name of a cluster, or
(2) the naming function sometimes (rarely) renames a cluster after a very tiny change.
If you go with alternative (1), it's trivial. You can simply assign names randomly, and then keep them unchanged whenever the cluster is updated as long as it's not too different (however you define different). Given how simple it is, I suppose that's not what you want.
If you go with alternative (2), you'll need to use some information about the underlying objects in the cluster. If all you have are links to various objects with no internal structure, it can't be done, since the function wouldn't have anything to work with apart from cluster size.
So let's say you have some information about the objects. For example, you may have their names. Call the first k letters of each object's name the object's prefix. Count all the different prefixes in your cluster, and find the n most common ones. Order these n prefixes alphabetically, and append them to each other in that order. For a reasonable choice of k, n (which should depend on the number of your clusters and typical object name lengths), you would get the result you seek - as long as you have enough objects in each cluster.
For instance, if objects have human names, try k = 2; and if you have hundreds of clusters, perhaps try n = 2.
This of course, can be greatly improved by remapping names to achieve a more uniform distribution, handling the cases where two prefixes have similar frequencies, etc.
Related
I have N lists of items
eg:
A, B, C, D
1, 2, 3
V, W, X, Y, Z
They are flattened into a single long list, and the user chooses an ordering to their liking
eg:
1, C, X, 3, B, A, Y, Z, 2, W, D, V
I need to re-order my N original lists so their relative sort order matches that in the user's ordering
eg:
C, B, A, D
1, 3, 2
X, Y, Z, W, V
The simple brute-force approach is to create N new empty containers, loop over the user's ordering and add each item into the relevant container as it is encountered.
Is there a more elegant approach?
There is not possibly a more elegant approach unless assumptions can be made about the ordering of the data.
You must, at some point, create each of the N new containers.
You must also, at some point, add the necessary elements to those N containers.
These two things cannot be avoided. Your approach has both of those and nothing more, and thus is proved minimal.
A minor caveat is that block array copies are slightly faster than iterative copies, so if you know of large blocks that are the same, then you can make a slightly faster copy for those blocks. But usually, in order to get that information, you must first visit and analyze the data. So instead of visiting and analyzing, you should just visit and insert.
You must either store the knowledge of which container an element came from at the beginning, or else store a mapping of element to position in the list and use that for sorting. (Or else save memory and do a ton of searching.)
If you're going to rearrange all lists, then it is more efficient to store the knowledge of which container each element comes from and proceed as you suggest. If you're going to only rearrange SOME lists (or rearrange future lists), then it may make more sense to store the mapping of element to position in the list and sort based on that. Which you can do either with a comparison function that goes through that lookup, or with a Schwartzian transform.
BTW have you thought about how to handle repeated elements?
I've been reading through The Art of Computer Programming, and though it has its moments of higher maths that I just can't get, some exercises have been fun to do.
After I've done one of them I go over to the answer, see if I did better or worse than what the book suggests (usually worse), But I don't get what the answer for the current one I'm on is trying to convey at all.
The book's question and proposed solution can be found here
What I've understood is that t may be the number of 'missing' elements or may be a general constant, but what I really don't understand is the seemingly arbitrary instruction to sort them based on their components, which to me looks like spinning your wheels in place since at first glance it doesn't get you closer to the original order. And the decision (among others) to replace one part of the paired names with a number ( file G contains all pairs (i,xi) for n−t < i ≤ n).
So my question is, simply, How do I extract an algorithm from this answer?
Bit of a clarification:
I understand what it aims to do, and how I would go into translating it into C++. What I do not understand is why I am supposed to sort the copies of the input file, and if so which criteria do I sort by, as well as the reasons to changing one side of the pairs to a number.
It's assumed that names are sortable, and that there are a sufficient number of tape drives to solve the problem. Define a pair as (name, next_name), where next_name is the name of person to the west. A copy of the file of pairs is made to another tape. The first file is sorted by name, the second file is sorted by next_name. Tape sorts are bottom up merge sort or a more complex variation called polyphase merge sort, but for this problem, standard bottom up merge sort is good enough. For C++, you could use std::stable_sort() to emulate a tape sort, using a lambda function for the compare, sorting by name for the first file and sorting by next_name for the second file.
The terminology for indexing uses name[1] to represent the eastern most name, and name[n] to represent the western most name.
After the initial sorting of the two files of pairs, the solution states that "passing over the files" is done to identify the next to last name, name[n-1], but doesn't specify how. In the process, I assume name[n] is also identified. The files are compared in sequence, comparing name from first file with next_name from second file. A mismatch indicates either the first name, name[1], or the last name, name[n], or in a vary rare circumstance, both, and the next pairs from each file have to be checked to determine what the mismatch indicates. At the time that the last name, name[n] is identified, then name from the second file pair will be the next to last name, name[n-1].
Once name[n-1] and name[n] are known, a merge like operation using both files is performed, skipping name[n-1] and name[n] to create F with pairs (name[i], name[i+2]) for i = 1 to n-2 (in name order), and G with two pairs (n-1, x[n-1]), and (n, x[n]), also in name order (G and G' are in name order until the last step).
F is copied to H, and an iterative process is performed as described in the algorithm, with t doubling each time, 2, 4, 8, ... . After each pass, F' contains pairs (x[i], x[i+t]) for i = 1 to n-t, then G' is sorted and merged with G back into G', resulting in a G' that contains the pairs (i, x[i]) for i = n-t to n, in name order. Eventually all the pairs end up in G (i, x[i]) for i = 1 to n, in name order, and then G is sorted by index (left part of pair), resulting in the names in sorted order.
I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.
I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.
I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.
In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?
Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?
I am trying to enumerate a number of failure cases for a system I am working on to make writing test cases easier. Basically, I have a group of "points" which communicate with an arbitrary number of other points through data "paths". I want to come up with failure cases in the following three sets...
Set 1 - Break each path individually (trivial)
Set 2 - For each point P in the system, break paths so that P is completely cut off from the rest of the system (also trivial)
Set 3 - For each point P in the system, break paths so that the system is divided into two groups of points (A and B, excluding point P) so that the only way to get from group A to group B is through point P (i.e., I want to force all data traffic in the system through point P to ensure that it can keep up). If this is not possible for a particular point, then it should be skipped.
Set 3 is what I am having trouble with. In practice, the systems I am dealing with are small and simple enough that I could probably "brute force" a solution (generally I have about 12 points, with each point connected to 1-4 other points). However, I would be interested in finding a more general algorithm for this type of problem, if anyone has any suggestions or ideas about where to start.
Here's some psuedocode, substituting the common graph theory terms of "nodes" for "points" and "edges" for "paths" assuming a path connects two points.
for each P in nodes:
for each subset A in nodes - {P}:
B = nodes - A - {P}
for each node in A:
for each edge out of A:
if the other end is in B:
break edge
run test
replace edges if necessary
Unless I'm misunderstanding something, the problem seems relatively simple as long as you have a method of generating the subsets of nodes-{P}. This will test each partition [A,B] twice unless you put some other check in there.
There are general algorithms for 'coloring' (with or without a u depending on whether you want UK or US articles) networks. However this is overkill for the relatively simple problem you describe.
Simply divide the nodes between two sets, then in pseudo-code:
foreach Node n in a.Nodes
foreach Edge e in n.Edges
if e.otherEnd in b then
e.break()
broken.add(e)
broken.get(rand(broken.size()).reinstate()
Either use rand to chosse a broken link to reinstate, or systematically reinstate one at a time
Repeat for b (or structure your edges such that a break in one direction affects the other)