How do I extract an algorithm from these instructions? - sorting

I've been reading through The Art of Computer Programming, and though it has its moments of higher maths that I just can't get, some exercises have been fun to do.
After I've done one of them I go over to the answer, see if I did better or worse than what the book suggests (usually worse), But I don't get what the answer for the current one I'm on is trying to convey at all.
The book's question and proposed solution can be found here
What I've understood is that t may be the number of 'missing' elements or may be a general constant, but what I really don't understand is the seemingly arbitrary instruction to sort them based on their components, which to me looks like spinning your wheels in place since at first glance it doesn't get you closer to the original order. And the decision (among others) to replace one part of the paired names with a number ( file G contains all pairs (i,xi) for n−t < i ≤ n).
So my question is, simply, How do I extract an algorithm from this answer?
Bit of a clarification:
I understand what it aims to do, and how I would go into translating it into C++. What I do not understand is why I am supposed to sort the copies of the input file, and if so which criteria do I sort by, as well as the reasons to changing one side of the pairs to a number.

It's assumed that names are sortable, and that there are a sufficient number of tape drives to solve the problem. Define a pair as (name, next_name), where next_name is the name of person to the west. A copy of the file of pairs is made to another tape. The first file is sorted by name, the second file is sorted by next_name. Tape sorts are bottom up merge sort or a more complex variation called polyphase merge sort, but for this problem, standard bottom up merge sort is good enough. For C++, you could use std::stable_sort() to emulate a tape sort, using a lambda function for the compare, sorting by name for the first file and sorting by next_name for the second file.
The terminology for indexing uses name[1] to represent the eastern most name, and name[n] to represent the western most name.
After the initial sorting of the two files of pairs, the solution states that "passing over the files" is done to identify the next to last name, name[n-1], but doesn't specify how. In the process, I assume name[n] is also identified. The files are compared in sequence, comparing name from first file with next_name from second file. A mismatch indicates either the first name, name[1], or the last name, name[n], or in a vary rare circumstance, both, and the next pairs from each file have to be checked to determine what the mismatch indicates. At the time that the last name, name[n] is identified, then name from the second file pair will be the next to last name, name[n-1].
Once name[n-1] and name[n] are known, a merge like operation using both files is performed, skipping name[n-1] and name[n] to create F with pairs (name[i], name[i+2]) for i = 1 to n-2 (in name order), and G with two pairs (n-1, x[n-1]), and (n, x[n]), also in name order (G and G' are in name order until the last step).
F is copied to H, and an iterative process is performed as described in the algorithm, with t doubling each time, 2, 4, 8, ... . After each pass, F' contains pairs (x[i], x[i+t]) for i = 1 to n-t, then G' is sorted and merged with G back into G', resulting in a G' that contains the pairs (i, x[i]) for i = n-t to n, in name order. Eventually all the pairs end up in G (i, x[i]) for i = 1 to n, in name order, and then G is sorted by index (left part of pair), resulting in the names in sorted order.

Related

Algorithm for predicting most likely items from lists of data

Lets say I have N lists which are known. Each list has items, which may repeat (Not a set)
eg:
{A,A,B,C}, {A,B,C}, {B,B,B,C,C}
I need some algorithm (Some machine-learning one maybe?) which answers the following question:
Given a new & unknown partial list of items, for example, {A,B}, what is the probability that C will appear in the list based on the what I know from the previous lists. If possible, I would like a more fine-grained probability of: given some partial list L, what is the probability that C will appear in the list once, probability it will appear twice, etc... Order doesn't matter. The probability of C appearing twice in {A,B} should equal it appearing twice in {B,A}
Any algorithms which can do this?
This is just pure mathematics, no actual "algorithms", simply estimate all the probabilities from your dataset (literally count the occurences). In particular you can do very simple data structure to achieve your goal. Represent each "list" as bag of letters, thus:
{A,A,B,C} -> {A:2, B:1, C:1}
{A,B} -> {A:1, B:1}
etc. and create basic reverse indexing of some sort, for example keep indexes for each letter separately, sorted by their counts.
Now, when a query comes, like {A,B} + C all you do is you search for your data that contains at least 1 A and 1 B (using your indexes), and then estimate probability by computing the fraction of retrived results containing C (or exactly one C) vs. all retrived results (this is a valid probability estimate assuming that your data is a bunch of independent samples from some underlying data-generating distribution).
Alternatively, if your alphabet is very small you can actually precompute all the values P(C|{A,B}) etc. for all combinations of letters.

Weighted unordered string edit distance

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

Algorithm to match sequential subset from a list

I am trying to remember the right algorithm to find a subset within a set that matches an element of a list of possible subsets. For example, given the input:
aehfaqptpzzy
and the subset list:
{ happy, sad, indifferent }
we can see that the word "happy" is a match because it is inside the input:
a e h f a q p t p z z y
I am pretty sure there is a specific algorithm to find all such matches, but I cannot remember what it is called.
UPDATE
The above example is not very good because it has letter repetitions, in fact in my problem both the dictionary entries and the input string are sortable sets. For example,
input: acegimnrqvy
dictionary:
{ cgn,
dfr,
lmr,
mnqv,
eg }
So in this example the algorithm would return cgn, mnqv and eg as matches. Also, I would like to find the best set of complementary matches where "best" means longest. So, in the example above the "best" answer would be "cgn mnqv", eg would not be a match because it conflicts with cgn which is a longer match.
I realize that the problem can be done by brute force scan, but that is undesirable because there could be thousands of entries in the dictionary and thousands of values in the input string. If we are trying to find the best set of matches, computability will become an issue.
You can use the Aho - Corrasick algorithm with more than one current states. For each of the input letters, you either stay (skip the letter) or move using the appropriate edge. If two or more "actors" meet at the same place, just merge them to one (if you're interested just in the presence and not counts).
About the complexity - this could be as slow as the naive O(MN) approach, because there can be up to size of dictionary actors. However, in practice, we can make a good use of the fact that many words are substrings of others, because there never won't be more than size of the trie actors, which - compared to the size of the dictionary - tends to be much smaller.

Word Prediction algorithm

I'm sure there is a post on this, but I couldn't find one asking this exact question. Consider the following:
We have a word dictionary available
We are fed many paragraphs of words, and I wish to be able to predict the next word in a sentence given this input.
Say we have a few sentences such as "Hello my name is Tom", "His name is jerry", "He goes where there is no water". We check a hash table if a word exists. If it does not, we assign it a unique id and put it in the hash table. This way, instead of storing a "chain" of words as a bunch of strings, we can just have a list of uniqueID's.
Above, we would have for instance (0, 1, 2, 3, 4), (5, 2, 3, 6), and (7, 8, 9, 10, 3, 11, 12). Note that 3 is "is" and we added new unique id's as we discovered new words. So say we are given a sentence "her name is", this would be (13, 2, 3). We want to know, given this context, what the next word should be. This is the algorithm I thought of, but I dont think its efficient:
We have a list of N chains (observed sentences) where a chain may be ex. 3,6,2,7,8.
Each chain is on average size M, where M is the average sentence length
We are given a new chain of size S, ex. 13, 2, 3, and we wish to know what is the most probable next word?
Algorithm:
First scan the entire list of chains for those who contain the full S input(13,2,3, in this example). Since we have to scan N chains, each of length M, and compare S letters at a time, its O(N*M*S).
If there are no chains in our scan which have the full S, next scan by removing the least significant word (ie. the first one, so remove 13). Now, scan for (2,3) as in 1 in worst case O(N*M*S) which is really S-1.
Continue scanning this way until we get results > 0 (if ever).
Tally the next words in all of the remaining chains we have gathered. We can use a hash table which counts every time we add, and keeps track of the most added word. O(N) worst case build, O(1) to find max word.
The max word found is the the most likely, so return it.
Each scan takes O(M*N*S) worst case. This is because there are N chains, each chain has M numbers, and we must check S numbers for overlaying a match. We scan S times worst case (13,2,3,then 2,3, then 3 for 3 scans = S). Thus, the total complexity is O(S^2 * M * N).
So if we have 100,000 chains and an average sentence length of 10 words, we're looking at 1,000,000*S^2 to get the optimal word. Clearly, N >> M, since sentence length does not scale with number of observed sentences in general, so M can be a constant. We can then reduce the complexity to O(S^2 * N). O(S^2 * M * N) may be more helpful for analysis though, since M can be a sizeable "constant".
This could be the complete wrong approach to take for this type of problem, but I wanted to share my thoughts instead of just blatantly asking for assitance. The reason im scanning the way I do is because I only want to scan as much as I have to. If nothing has the full S, just keep pruning S until some chains match. If they never match, we have no idea what to predict as the next word! Any suggestions on a less time/space complex solution? Thanks!
This is the problem of language modeling. For a baseline approach, The only thing you need is a hash table mapping fixed-length chains of words, say of length k, to the most probable following word.(*)
At training time, you break the input into (k+1)-grams using a sliding window. So if you encounter
The wrath sing, goddess, of Peleus' son, Achilles
you generate, for k=2,
START START the
START the wrath
the wrath sing
wrath sing goddess
goddess of peleus
of peleus son
peleus son achilles
This can be done in linear time. For each 3-gram, tally (in a hash table) how often the third word follows the first two.
Finally, loop through the hash table and for each key (2-gram) keep only the most commonly occurring third word. Linear time.
At prediction time, look only at the k (2) last words and predict the next word. This takes only constant time since it's just a hash table lookup.
If you're wondering why you should keep only short subchains instead of full chains, then look into the theory of Markov windows. If your model were to remember all the chains of words that it has seen in its input, then it would badly overfit its training data and only reproduce its input at prediction time. How badly depends on the training set (more data is better), but for k>4 you'd really need smoothing in your model.
(*) Or to a probability distribution, but this is not needed for your simple example use case.
Yeh Whye Teh also has some recent interesting work that addresses this problem. The "Sequence Memoizer" extends the traditional prediction-by-partial-matching scheme to take into account arbitrarily long histories.
Here is a link the original paper: http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
It is also worth reading some of the background work, which can be found in the paper "A Bayesian Interpretation of Interpolated Kneser-Ney"

How to determine correspondence between two lists of names?

I have:
1 million university student names and
3 million bank customer names
I manage to convert strings into numerical values based on hashing (similar strings have similar hash values). I would like to know how can I determine correlation between these two sets to see if values are pairing up at least 60%?
Can I achieve this using ICC? How does ICC 2-way random work?
Please kindly answer ASAP as I need this urgently.
This kind of entity resolution etc is normally easy, but I am surprised by the hashing approach here. Hashing loses information that is critical to entity resolution. So, if possible, you shouldn't use hash, rather the original strings.
Assuming using original strings is an option, then you would want to do something like this:
List A (1M), List B (3M)
// First, match the entities that match very well, and REMOVE them.
for a in List A
for b in List B
if compare(a,b) >= MATCH_THRESHOLD // This may be 90% etc
add (a,b) to matchedList
remove a from List A
remove b from List B
// Now, match the entities that match well, and run bipartite matching
// Bipartite matching is required because each entity can match "acceptably well"
// with more than one entity on the other side
for a in List A
for b in List B
compute compare(a,b)
set edge(a,b) = compare(a,b)
If compare(a,b) < THRESHOLD // This seems to be 60%
set edge(a,b) = 0
// Now, run bipartite matcher and take results
The time complexity of this algorithm is O(n1 * n2), which is not very good. There are ways to avoid this cost, but they depend upon your specific entity resolution function. For example, if the last name has to match (to make the 60% cut), then you can simply create sublists in A and B that are partitioned by the first couple of characters of the last name, and just run this algorithm between corresponding list. But it may very well be that last name "Nuth" is supposed to match "Knuth", etc. So, some local knowledge of what your name comparison function is can help you divide and conquer this problem better.

Resources