What is non-crossing matching between two strings in diff3 algorithm? - algorithm

I'm reading through this article https://www.cis.upenn.edu/~bcpierce/papers/diff3-short.pdf and can't understand what this paragraph means:
Below is the copy-pasted text from the paper and a screenshot of the same paragraph for better readability
The first step of diff3 is to call a two-way comparison subroutine on (O, A) and (O, B) to compute a non-crossing matching M_a between the indices of O
and A — that is, a boolean function on pairs of indices from O and A such that if M_a[i, j] = true then (a) O[i] = A[j], (b) M_a[i′, j] = false and M_a[i, j′] = false whenever i′ != i and j′ != j, and (c) MA[i′, j′] = false whenever either i′ < i and j′ > j or i′ > i and j′ < j — and a non-crossing matching M_b between the indices of O and B. We treat this algorithm as a black box, simply assuming (a) that it is deterministic, and (b) that it always yields maximum matchings. For the counterexamples in the next section, we have verified that the matchings we use correspond to the ones actually chosen by GNU diff3
Where O is a list of characters of the base string, A - is the list of a string, and B - is the list of characters in string b (if I understood beginning of the article correctly)
Q: What is non-crossing matching between two strings? Can you show a visual example of this?

Related

Ukkonen's suffix tree algorithm: procedure 'test and split' unclear

ukkonen's on line construction algorithm
i got a problem trying to understand the 'test and split' procedure,which is as follows:
procedure test–and–split(s, (k, p), t):
>1. if k ≤ p then
>2. let g'(s,(k',p'))=s' be the tk-transition from s
>3. if t=t(k'+p-k+1) then return (true,s)
my problem is that what exactly does the 2nd line mean,how can g'(s,(k',p'))be still a tk-transition if it starts from s and followed by t(k') instead of t(k)??
Probably you already figured it out and you don't need an answer anymore, but since I had the same problem in trying to understand it, and maybe it'll be useful for someone else in the future, the answer I think is the following one.
In Ukkonen's on line construction algorithm, on page 7 you can read that:
...
The string w spelled out by the transition path in STrie(T) between two explicit states s and r is represented in STree(T) as generalized transition g′(s,w) = r. To save space the string w is actually represented as a pair (k,p) of pointers (the left pointer k and the right pointer p) to T such that tk . . . tp = w. In this way the generalized transition gets form g′(s, (k, p)) = r.
Such pointers exist because there must be a suffix Ti such that the transition path for Ti in STrie(T) goes through s and r. We could select the smallest such i, and let k and p point to the substring of this Ti that is spelled out by the transition path from s to r. A transition g′(s, (k, p)) = r is called an a–transition if tk = a. Each s can have at most one a–transition for each a ∈ Σ.
...
This means that we are looking for the smallest indexes k and p such that tk . . . tp = w in T
=> if there is more than one occurrence of w in T, with k and p we always reference the first one.
Now, procedure test–and–split(s,(k,p),t) tests whether or not a state with canonical reference pair (s,(k,p)) is the endpoint, that is, a state that in STrie(T i−1) would have a ti –transition. Symbol ti is given as input parameter t.
The first lines of the algorithm are the following:
procedure test–and–split(s,(k,p),t):
1. if k ≤ p then
2. let g′(s,(k′,p′)) = s′ be the t(k)–transition from s;
3. if t = t(k′+p−k+1) then return(true,s)
4. else ...
On line 1 we check if the state is implicit (that is when k <= p).
If so, then on line 2 we want to find the transition from s that starts with the character we find in pos k of T (that is tk). Note that tk must be equal to tk' but indexes k and k' can be different because we always point to the first occurrence of a string w in T (remember also that from one state there can be at most one transition that starts with character tk => so that's the correct and the only one).
Then on line 3 we check if the state referenced by the canonical reference pair (s,(k,p)) is the endpoint, that is if it has a ti -transition. The state (s,(k,p)) is the one (implicit or not) that we can reach from state s, following the tk' -transition (that is the tk-transition because k' = k) for (p - k) characters. This explains the tk′+p−k+1, where the +1 is for the next character, the one that we are checking if it is equal to t (where t = ti). In that case we reached the endpoint and we return true.
Else, starting from line 4, we split the transition g′(s,(k′,p′)) = s′ to make explicit the state (s,(k,p)) and return the new explicit state.

Given a dictionary and a list of letters find all valid words that can be built with the letters

The brute force way can solve the problem in O(n!), basically calculating all the permutations and checking the results in a dictionary. I am looking for ways to improve the complexity. I can think of building a tree out of the dictionary but still checking all letters permutations is O(n!). Are there better ways to solve this problem?
Letters can have duplicates.
The api for the function looks like this:
List<String> findValidWords(Dict dict, char letters[])
Assume that letters only contains letters from a to z.
Use an integer array to count the number of occurrence of a character in letters.
For each word in the dictionary, check if there is a specific character in the word that appears more than allowed, if not, add this word into result.
List<String> findValidWords(List<String> dict, char letters[]){
int []avail = new int[26];
for(char c : letters){
int index = c - 'a';
avail[index]++;
}
List<String> result = new ArrayList();
for(String word: dict){
int []count = new int[26];
boolean ok = true;
for(char c : word.toCharArray()){
int index = c - 'a';
count[index]++;
if(count[index] > avail[index]){
ok = false;
break;
}
}
if(ok){
result.add(word);
}
}
return result;
}
So we can see that the time complexity is O(m*k) with m is number of word in the dictionary and k is the maximum total of characters in a word
You can sort each word in your dictionary so that the letters appear in the same order as they do in the alphabet, and then build a trie out of your sorted words. (where each node contains a list of all words that can be made out of the letters). (linear time in total letter length of dictionary) Then, given a set of query letters, sort the letters the same way and proceed through the trie using depth first search in all possible directions that use a subset of your letters from left to right. Any time you reach a node in the trie that contains words, output those words. Each path you explore can be charged to at least one word in the dictionary, so the worst case complexity to find all nodes that contain words you can make is O(kn) where n is the number of words in the dictionary and k is the maximum number of letters in a word. However for somewhat restricted sets of query letters, the running time should be much faster per query.
Here is the algorithm that will find all words that can be formed from a set of letters in O(1). We will represent words with their spectra and store them in a prefix tree (aka trie).
General Description
The spectrum of a word W is an array S of size N, such that S(i) is the number of occurrences (aka frequency) of an A(i) letter in the word W, where A(i) is the i-th letter of a chosen alphabet and N is its size.
For example, in the English alphabet, A(0) is A, A(1) is B, ... , A(25) is Z. A spectrum of the word aha is <2,0,0,0,0,0,0,1,0,...,0>.
We will store the dictionary in a prefix trie, using spectrum as a key. The first token of a key is the frequency of letter A, the second is the frequency of letter B and so on. (From here and below we will use the English alphabet as an example).
Once formed, our dictionary will be a tree with the height 26 and width that varies with each level, depending on a popularity of the letter. Basically, each layer will have a number of subtrees that is equal to the maximum word frequency of this letter in the provided dictionary.
Since our task is not only to decide whether we can build a word from the provided set of characters but also to find these words (a search problem), then we need to attach the words to their spectra (as spectral transformation is not invertible, consider spectra of words read and dear). We will attach a word to the end of each path that represents its spectrum.
To find whether we can build a word from a provided set we will build a spectrum of the set, and find all paths in the prefix trie with the frequencies bounded by the corresponding frequencies of the set's spectrum. (Note, we are not forcing to use all letters from the set, so if a word uses fewer letters, then we can build it. Basically, our requirement is that for all letters in the word the frequency of a letter should be less than or equal than a frequency of the same letter in the provided set).
The complexity of the search procedure doesn't depend on the length of the dictionary or the length of the provided set. On average, it is equal to 26 times the average frequency of a letter. Given the English alphabet, it is a quite small constant factor. For other alphabets, it might not be the case.
Reference implementation
I will provide a reference implementation of an algorithm in OCaml.
The dictionary data type is recursive:
type t = {
dict : t Int.Map.t;
data : string list;
}
(Note: it is not the best representation, probably it is better to represent it is a sum type, e.g., type t = Dict of t Int.Map.t | Data of string list, but I found it easier to implement it with the above representation).
We can generalize the algorithm by a spectrum function, either using a functor, or by just storing the spectrum function in the dictionary, but for the simplicity, we will just hardcode the English alphabet in the ASCII representation,
let spectrum word =
let index c = Char.(to_int (uppercase c) - to_int 'A') in
let letters = Char.(to_int 'Z' - to_int 'A' + 1) in
Array.init letters ~f:(fun i ->
String.count word ~f:(fun c -> index c = i))
Next, we will define the add_word function of type dict -> string -> dict, that will add a new path to our dictionary, by decomposing a word to its spectrum, and adding each constituent. Each addition will require exactly 26 iterations, not including the spectrum computation. Note, the implementation is purely functional, and doesn't use any imperative features. Every time the function add_word returns a new data structure.
let add_word dict word =
let count = spectrum word in
let rec add {dict; data} i =
if i < Array.length count then {
data;
dict = Map.update dict count.(i) ~f:(function
| None -> add empty (i+1)
| Some sub -> add sub (i+1))
} else {empty with data = word :: data} in
add dict 0
We are using the following definition of the empty value in the add function:
let empty = {dict = Int.Map.empty; data=[]}
Now let's define the is_buildable function of type dict -> string -> bool that will decide whether the given set of characters can be used to build any word in the dictionary. Although we can express it via the search, by checking the size of the found set, we would still prefer to have a specialized implementation, as it is more efficient and easier to understand. The definition of the function follows closely the general description provided above. Basically, for every character in the alphabet, we check whether there is an entry in the dictionary with the frequency that is less or equal than the frequency in the building set. If we checked all letters, then we proved, that we can build at least one word with the given set.
let is_buildable dict set =
let count = spectrum set in
let rec find {dict} i =
i >= Array.length count ||
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.exists ~f:(fun cnt -> match Map.find dict cnt with
| None -> false
| Some dict -> find dict (i+1)) in
find dict 0
Now, let's actually find the set of all words, that are buildable from the provided set:
let build dict set =
let count = spectrum set in
let rec find {dict; data} i =
if i < Array.length count then
Sequence.range 0 count.(i) ~stop:`inclusive |>
Sequence.concat_map ~f:(fun cnt -> match Map.find dict cnt with
| None -> Sequence.empty
| Some dict -> find dict (i+1))
else Sequence.of_list data in
find dict 0
We will basically follow the structure of the is_buildable function, except that instead of proving that such a frequency exists for each letter, we will collect all the proofs by reaching the end of the path and grabbing the set of word attached to it.
Testing and example
For the sake of completeness, we will test it by creating a small program, that will read a dictionary, with each word on a separate line, and interact with a user, by asking for a set and printing the resultion set of words, that can be built from it.
module Test = struct
let run () =
let dict =
In_channel.(with_file Sys.argv.(1)
~f:(fold_lines ~init:empty ~f:add_word)) in
let prompt () =
printf "Enter characters and hit enter (or Ctrl-D to stop): %!" in
prompt ();
In_channel.iter_lines stdin ~f:(fun set ->
build dict set |> Sequence.iter ~f:print_endline;
prompt ())
end
Here comes and example of interaction, that uses /usr/share/dict/american-english dictionary available on my machine (Ubunty Trusty).
./scrabble.native /usr/share/dict/american-english
Enter characters and hit enter (or Ctrl-D to stop): read
r
R
e
E
re
Re
Er
d
D
Rd
Dr
Ed
red
Red
a
A
Ra
Ar
era
ear
are
Rae
ad
read
dear
dare
Dare
Enter characters and hit enter (or Ctrl-D to stop):
(Yep, the dictionary contains words, that like r and d that are probably not true English words. In fact, for each letter the dictionary has a word, so, we can basically build a word from each non-empty set of alphabet letters).
The full implementation along with the building instructions can be found on Gist
A better way to do this is to loop through all the words in the dictionary and see if the word can be built with the letters in the array.
"Sign" the letters available by sorting them in order; that's O(m log m), where m is the number of letters.
"Sign" each word in the dictionary by sorting the letters of the word in order; that's O(k log k), where k is the length of the word.
Compare the letter signature to each word signature; that's O(min(m, k) * n), where n is the number of words in the dictionary. Output any word that matches.
Assuming an English word list of approximately a quarter-million words, and no more than about half a dozen, that should be nearly instantaneous.
I was recently asked the same question in BankBazaar interview. I was given the option to (he said that in a very subtle manner) pre-process the dictionary in any way I want.
My first thought was to arrange the dictionary in a trie or ternary search tree, and make all the words from the letters given. In any optimization way, that would take n! + n-1! + n-2! n-3! + ..... + n word checks(n being the number of letters) in worst case, which was not acceptable.
The other way could be to check all the dictionary words whether they can be made from the given letters. This again in any optimized way would take noOfDictionaryWords(m) * average size of dictionary words(k) at worst case, which was again not acceptable.
Now I have n! + n-1! + n-2! + .... + N words, which I have to check in the dictionary, and I don't want to check them all, so what are the situations that I have to check only a subset of them, and how to group them.
If I have to check only combination and not permutation, the result gets to 2^n.
so I have to pre-process the dictionary words in such a way that if I pass a combination, all the anagrams would be printed.
A ds something like this : http://1.bp.blogspot.com/-9Usl9unQJpY/Vg6IIO3gpsI/AAAAAAAAAbM/oTuhRDWelhQ/s1600/hashmapArrayForthElement.png
A hashvalue made by the letters(irrespective of its positions and permutation), pointing to list containing all the words made by those letters, then we only need to check that hashvalue.
I gave the answer to make the hash value by assigning a prime value to all the alphabets and while calculating the hash value of a word, multiply all the assigned values. This will create a problem of having really big hash values given that 26th prime is 101, and many null values in the map taking space. We could optimize it a bit by rather than starting lexicographically with a = 2, b = 3, c = 5, d = 7.... z = 101, we search for the most used alphabets and assign them small values, like vowels, and 's', 't' etc.
The interviewer accepted it, but was not expecting the answer, so there is definitely another answer, for better or worse but there is.
Swift 4
func findValidWords(in dictionary: [String], with letters: [Character]) -> [String] {
var validWords = [String]()
for word in dictionary {
var temp = word
for char in letters {
temp = temp.filter({ $0 != char })
if temp.isEmpty {
validWords.append(word)
break
}
}
}
return validWords
}
print(findValidWords(in: ["ape", "apples", "orange", "elapse", "lap", "soap", "bar", "sole"], with: ["a","p","l","e","s","o"]))
Output => ["ape", "apples", "elapse", "lap", "soap", "sole"]
My English is not good so try to understand.
My approach is using bit/bitwise to increase speed. Still bruteforce, though.
FIRST STEP
We only consider distinct character in each word and mark its existence. English has 26 characters, so we need 26 bits. Integer is 32 bits. That's enough.
Now encode each words in dictionary to an integer number.
abcdddffg -> 123444667 -> 123467 (only distinct characters) -> 1111011 (bits) -> 123 (decimal number)
So 2,000,000 words will be converted into 2,000,000 integer numbers.
Now let say you have this set of letters: a,b,c,d,e
abcde -> 12345 -> 1111100 (bits)
Do AND operation and we have:
1111100 (abcde)
&
1111011 (abcdddffg, no e)
=
1111000 (result) => result != abcdddffg => word cannot be created
Other example with a,b,c,d,e,f,g,h:
11111111 (abcdefgh)
&
11110110 (abcdddffg, no e and h)
=
11110110 (result) => result == abcdddffg => word can be created
SECOND STEP
While converting word to number, store the letter count also. If we found a match in first step, we continue to check if the number of letters is enough too.
Depend on the requirement, you might not need this second step.
COMPLEXITY
O(n) to convert word to number and store letters count. Only need to do this once.
O(n) for each search query.
Following is more efficient way :-
1.Use count sort to count all letters appearing in the a word in dictionary.
2.Do count sort on the collection of letter that you are given.
3.Compare if the counts are same then the word can be made.
4. Do this for all words in dictionary.
This will be inefficient for multiple such queries so you can do following :-
1. make a tupple for each word using count sort.
2. put the tupple in a Tree or hashmap with count entries.
3. When query is given do count sort and lookup tupple in hashmap
.
Time complexity :-
The above method gives O(1) time complexity for a query and O(N) time complexity for hash table construction where N is no of words in dictionary
(cf. anagram search, e.g. using primes looks cleaner for a signature based approach - collect for all non-equivalent "substrings of letters"])
Given the incentive, I'd (pre)order Dict by (set of characters that make up each word, increasing length) and loop over the subsets from letters checking validity of each word until too long.
Alternatively, finding the set of words from dict out of chars from letters can be considered a multi-dimensional range query: with "eeaspl" specifying letters, valid words have zero to two "e", one or none of a, s, p, l, and no other characters at all - bounds on word length (no longer than letters, lower bound to taste) blend in nicely.
Then again, data structures like k-d-trees do well with few, selective dimensions.
(Would-be comment: you do not mention alphabet cardinality, whether "valid" depends on capitalisation or diacritics, "complexity" includes programmer effort or preprocessing of dict - the latter may be difficult to amortise if dict is immutable.)
Swift 3
func findValidWords(wordsList: [String] , string: String) -> [String]{
let charCountsDictInTextPassed = getCharactersCountIn(string: string)
var wordsArrayResult: [String] = []
for word in wordsList {
var canBeProduced = true
let currentWordCharsCount = getCharactersCountIn(string: word)
for (char, count) in currentWordCharsCount {
if let charCountInTextPassed = charCountsDictInTextPassed[char], charCountInTextPassed >= count {
continue
}else{
canBeProduced = false
break
}
}// end for
if canBeProduced {
wordsArrayResult.append(word)
}//end if
}//end for
return wordsArrayResult
}
// Get the count of each character in the string
func getCharactersCountIn(string: String) -> [String: Int]{
var charDictCount:[String: Int] = [:]
for char in string.characters {
if let count = charDictCount[String(char)] {
charDictCount[String(char)] = count + 1
}else{
charDictCount[String(char)] = 1
}
}//end for
return charDictCount
}
If letters can be repeated, that means that a word can be infinitely long. You would obviously cap this at the length of the longest word in the dictionary, but there are still too many words to check. Like nmore suggested, you'd rather iterate over the dictionary to do this.
List<String> findAllValidWords(Set<String> dict, char[] letters) {
List<String> result = new LinkedList<>();
Set<Character> charSet = new HashSet<>();
for (char letter : letters) {
charSet.add(letter);
}
for (String word : dict) {
if (isPossible(word, charSet)) {
result.add(word);
}
}
return result;
}
boolean isPossible(String word, Set<Character> charSet) {
// A word is possible if all its letters are contained in the given letter set
for (int i = 0; i < word.length(); i++) {
if (!charSet.contains(word.charAt(i))) {
return false;
}
}
return true;
}

A Dynamic programming problem

Can anyone help me find an optimal Dynamic programming algorithm for this problem
On the way to dinner, the CCC competitors are lining up for their delicious curly fries. The N (1 ≤ N ≤ 100) competitors have lined up single-file to enter the cafeteria.
Doctor V, who runs the CCC, realized at the last minute that programmers simply hate standing in line next to programmers who use a different language. Thankfully, only two languages are allowed at the CCC: Gnold and Helpfile. Furthermore, the competitors have decided that they will only enter the cafeteria if they are in a group of at least K (1 ≤ K ≤ 6) competitors.
Doctor V decided to iterate the following scheme:
* He will find a group of K or more competitors who use the same language standing next to each other in line and send them to dinner.
* The remaining competitors will close the gap, potentially putting similar-language competitors together.
So Doctor V recorded the sequence of competitors for you. Can all the competitors dine? If so, what is the minimum number of groups of competitors to be sent to dinner?
Input
The first line contains two integers N and K.
The second line contains N characters that are the sequence of competitors in line (H represents Helpfile, G represents Gnold)
Output
Output, on one line, the single number that is the minimum number of groups that are formed for dinner. If not all programmers can dine, output -1.
I'd prefer not to solve an SPOJ problem in a practical manner for you, so take the following as an existence proof of a poly-time DP.
For K fixed, the set of strings that can dine is context-free. I'm going to use g and h instead of G and H. For example, for K = 3, one grammar looks like
S -> ε | g S g S g S G | h S h S h S H
G -> ε | g S G
H -> ε | h S H
The idea is that either there are no diners, or the first diner dines with at least K - 1 others, between any two of which (and the last and the end) there is a string that can dine.
Now use the weighted variant of CYK to find the minimum-weight parse, where nonempty S productions have weight 1, and all others have weight 0. For K fixed, the running time of CYK is O(N3).
The subproblem is the minimal groups needed to get everyone to dine for a given state of the line. There are a whole lot of possible line states but only a few that will actually be seen, so your memoization should probably be done with a hash map.
Here's some pseudocode.
int dine(string line){
if(hashmap.contains(line)){
return hashmap.get(line);
}
if(line.length == 0){
return 0;
}
best = N+1;
for(i=0;i<line.length;i=j){
type = string[i];
j = i+1;
while(type == string[j]){
j++;
}
if(j-i >= K){
result = dine(string.substring(0,i-1) + string.substring(j,string.length));
if(result > 0 && result < best){
best = result;
}
}
}
if(best == N+1){
hashmap.insert(line, -1);
return -1;
}
else {
hashmap.insert(line, best+1);
return best+1;
}
}
If you already found answer for this line, return that answer. If theres no people in the line, you are already done and you don't need to form any more groups.
Assume you can't form any groups. Then try to prove this wrong by trying out all the contiguous groups of like-minded programmers in the line. If the group is big enough to get picked, see how many more moves it would take to get everyone in the resturaunt after removing this group. Keep track of the least moves needed.
If you couldn't find a way to remove all the groups, return -1. Otherwise, return the least moves needed after removing a group plus one to account for the move you made on this step.
How about divide and conquer? Take a (removable) group somewhere near the middle, and the two groups either side of it, say ...HHHGGGGGHHHHH.... - now there's two possibilities. Either those 2 sets of H's dine in the same group, or they don't. If they dine in the same group, then those G's between them must be removed as a group of precisely those G's (so you may as well try that as your first move). If they don't, then you can solve for the left and right sublists separately. Either case, you've got a shorter list to recurse on.

Algorithm to merge two lists lacking comparison between them

I am looking for an algorithm to merge two sorted lists,
but they lack a comparison operator between elements of one list and elements of the other.
The resulting merged list may not be unique, but any result which satisfies the relative sort order of each list will do.
More precisely:
Given:
Lists A = {a_1, ..., a_m}, and B = {b_1, ..., b_n}. (They may be considered sets, as well).
A precedence operator < defined among elements of each list such that
a_i < a_{i+1}, and b_j < b_{j+1} for 1 <= i <= m and 1 <= j <= n.
The precedence operator is undefined between elements of A and B:
a_i < b_j is not defined for any valid i and j.
An equality operator = defined among all elements of either A or B
(it is defined between an element from A and an element from B).
No two elements from list A are equal, and the same holds for list B.
Produce:
A list C = {c_1, ..., c_r} such that:
C = union(A, B); the elements of C are the union of elements from A and B.
If c_p = a_i, c_q = a_j, and a_i < a_j, then c_p < c_q. (The order of elements
of the sublists of C corresponding to sets A and B should be preserved.
There exist no i and j such that c_i = c_j.
(all duplicated elements between A and B are removed).
I hope this question makes sense and that I'm not asking something either terribly obvious,
or something for which there is no solution.
Context:
A constructible number can be represented exactly in finitely many quadratic extensions to the field of rational numbers (using a binary tree of height equal to the number of field extensions).
A representation of a constructible number must therefore "know" the field it is represented in.
Lists A and B represent successive quadratic extensions of the rational numbers.
Elements of A and B themselves are constructible numbers, which are defined in the context
of previous smaller fields (hence the precedence operator). When adding/multiplying constructible numbers,
the quadratically extended fields must first be merged so that the binary arithmetic
operations can be performed; the resulting list C is the quadratically extended field which
can represent numbers representable by both fields A and B.
(If anyone has a better idea of how to programmatically work with constructible numbers, let me know. A question concerning constructible numbers has arisen before, and also here are some interesting responses about their representation.)
Before anyone asks, no, this question does not belong on mathoverflow; they hate algorithm (and generally non-graduate-level math) questions.
In practice, lists A and B are linked lists (stored in reverse order, actually).
I will also need to keep track of which elements of C corresponded to which in A and B, but that is a minor detail.
The algorithm I seek is not the merge operation in mergesort,
because the precedence operator is not defined between elements of the two lists being merged.
Everything will eventually be implemented in C++ (I just want the operator overloading).
This is not homework, and will eventually be open sourced, FWIW.
I don't think you can do it better than O(N*M), although I'd be happy to be wrong.
That being the case, I'd do this:
Take the first (remaining) element of A.
Look for it in (what's left of) B.
If you don't find it in B, move it to the output
If you do find it in B, move everything from B up to and including the match, and drop the copy from A.
Repeat the above until A is empty
Move anything left in B to the output
If you want to detect incompatible orderings of A and B, then remove "(what's left of)" from step 2. Search the whole of B, and raise an error if you find it "too early".
The problem is that given a general element of A, there is no way to look for it in B in better than linear time (in the size of B), because all we have is an equality test. But clearly we need to find the matches somehow and (this is where I wave my hands a bit, I can't immediately prove it) therefore we have to check each element of A for containment in B. We can avoid a bunch of comparisons because the orders of the two sets are consistent (at least, I assume they are, and if not there's no solution).
So, in the worst case the intersection of the lists is empty, and no elements of A are order-comparable with any elements of B. This requires N*M equality tests to establish, hence the worst-case bound.
For your example problem A = (1, 2, c, 4, 5, f), B = (a, b, c, d, e, f), this gives the result (1,2,a,b,c,4,5,d,e,f), which seems good to me. It performs 24 equality tests in the process (unless I can't count): 6 + 6 + 3 + 3 + 3 + 3. Merging with A and B the other way around would yield (a,b,1,2,c,d,e,4,5,f), in this case with the same number of comparisons, since the matching elements just so happen to be at equal indices in the two lists.
As can be seen from the example, the operation can't be repeated. merge(A,B) results in a list with an order inconsistent with that of merge(B,A). Hence merge((merge(A,B),merge(B,A)) is undefined. In general, the output of a merge is arbitrary, and if you go around using arbitrary orders as the basis of new complete orders, you will generate mutually incompatible orders.
This sounds like it would use a degenerate form of topological sorting.
EDIT 2:
Now with a combined routine:
import itertools
list1 = [1, 2, 'c', 4, 5, 'f', 7]
list2 = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
ibase = 0
result = []
for n1, i1 in enumerate(list1):
for n2, i2 in enumerate(itertools.islice(list2, ibase, None, 1)):
if i1 == i2:
result.extend(itertools.islice(list2, ibase, ibase + n2))
result.append(i2)
ibase = n2 + ibase + 1
break
else:
result.append(i1)
result.extend(itertools.islice(list2, ibase, None, 1))
print result
Would concatenating the two lists be sufficient? It does preserve the relative sortedness of elements from a and elements from b.
Then it's just a matter of removing duplicates.
EDIT: Alright, after the comment discussion (and given the additional condition that a_i=b_i & a_j=b_j & a_i<a_j => b_i<b-J), here's a reasonable solution:
Identify entries common to both lists. This is O(n2) for the naive algorithm - you might be able to improve it.
(optional) verify that the common entries are in the same order in both lists.
Construct the result list: All elements of a that are before the first shared element, followed by all elements of b before the first shared element, followed by the first shared element, and so on.
Given the problem as you have expressed it, I have a feeling that the problem may have no solution. Suppose that you have two pairs of elements {a_1, b_1} and {a_2, b_2} where a_1 < a_2 in the ordering of A, and b_1 > b_2 in the ordering of B. Now suppose that a_1 = b_1 and a_2 = b_2 according to the equality operator for A and B. In this scenario, I don't think you can create a combined list that satisfies the sublist ordering requirement.
Anyway, there's an algorithm that should do the trick. (Coded in Java-ish ...)
List<A> alist = ...
List<B> blist = ...
List<Object> mergedList = new SomeList<Object>(alist);
int mergePos = 0;
for (B b : blist) {
boolean found = false;
for (int i = mergePos; i < mergedList.size(); i++) {
if (equals(mergedList.get(i), b)) {
found = true; break;
}
}
if (!found) {
mergedList.insertBefore(b, mergePos);
mergePos++;
}
}
This algorithm is O(N**2) in the worst case, and O(N) in the best case. (I'm skating over some Java implementation details ... like combining list iteration and insertion without a major complexity penalty ... but I think it can be done in this case.)
The algorithm neglects the pathology I mentioned in the first paragraph and other pathologies; e.g. that an element of B might be "equal to" multiple elements of A, or vice versa. To deal with these, the algorithm needs to check each b against all elements of the mergedList that are not instances of B. That makes the algorithm O(N**2) in the best case.
If the elements are hashable, this can be done in O(N) time where N is the total number of elements in A and B.
def merge(A, B):
# Walk A and build a hash table mapping its values to indices.
amap = {}
for i, a in enumerate(A):
amap[a] = i
# Now walk B building C.
C = []
ai = 0
bi = 0
for i, b in enumerate(B):
if b in amap:
# b is in both lists.
new_ai = amap[b]
assert new_ai >= ai # check for consistent input
C += A[ai:new_ai] # add non-shared elements from A
C += B[bi:i] # add non-shared elements from B
C.append(b) # add the shared element b
ai = new_ai + 1
bi = i + 1
C += A[ai:] # add remaining non-shared elements from A
C += B[bi:] # from B
return C
A = [1, 2, 'c', 4, 5, 'f', 7]
B = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
print merge(A, B)
(This is just an implementation of Anon's algorithm. Note that you can check for inconsistent input lists without hurting performance and that random access into the lists is not necessary.)

Algorithm to find matching pairs in a list

I will phrase the problem in the precise form that I want below:
Given:
Two floating point lists N and D of the same length k (k is multiple of 2).
It is known that for all i=0,...,k-1, there exists j != i such that D[j]*D[i] == N[i]*N[j]. (I'm using zero-based indexing)
Return:
A (length k/2) list of pairs (i,j) such that D[j]*D[i] == N[i]*N[j].
The pairs returned may not be unique (any valid list of pairs is okay)
The application for this algorithm is to find reciprocal pairs of eigenvalues of a generalized palindromic eigenvalue problem.
The equality condition is equivalent to N[i]/D[i] == D[j]/N[j], but also works when denominators are zero (which is a definite possibility). Degeneracies in the eigenvalue problem cause the pairs to be non-unique.
More generally, the algorithm is equivalent to:
Given:
A list X of length k (k is multiple of 2).
It is known that for all i=0,...,k-1, there exists j != i such that IsMatch(X[i],X[j]) returns true, where IsMatch is a boolean matching function which is guaranteed to return true for at least one j != i for all i.
Return:
A (length k/2) list of pairs (i,j) such that IsMatch(i,j) == true for all pairs in the list.
The pairs returned may not be unique (any valid list of pairs is okay)
Obviously, my first problem can be formulated in terms of the second with IsMatch(u,v) := { (u - 1/v) == 0 }. Now, due to limitations of floating point precision, there will never be exact equality, so I want the solution which minimizes the match error. In other words, assume that IsMatch(u,v) returns the value u - 1/v and I want the algorithm to return a list for which IsMatch returns the minimal set of errors. This is a combinatorial optimization problem. I was thinking I can first naively compute the match error between all possible pairs of indexes i and j, but then I would need to select the set of minimum errors, and I don't know how I would do that.
Clarification
The IsMatch function is reflexive (IsMatch(a,b) implies IsMatch(b,a)), but not transitive. It is, however, 3-transitive: IsMatch(a,b) && IsMatch(b,c) && IsMatch(c,d) implies IsMatch(a,d).
Addendum
This problem is apparently identically the minimum weight perfect matching problem in graph theory. However, in my case I know that there should be a "good" perfect matching, so the distribution of edge weights is not totally random. I feel that this information should be used somehow. The question now is if there is a good implementation to the min-weight-perfect-matching problem that uses my prior knowledge to arrive at a solution early in the search. I'm also open to pointers towards a simple implementation of any such algorithm.
I hope I got your problem.
Well, if IsMatch(i, j) and IsMatch(j, l) then IsMatch(i, l). More generally, the IsMatch relation is transitive, commutative and reflexive, ie. its an equivalence relation. The algorithm translates to which element appears the most times in the list (use IsMatch instead of =).
(If I understand the problem...)
Here is one way to match each pair of products in the two lists.
Multiply each pair N and save it to a structure with the product, and the subscripts of the elements making up the product.
Multiply each pair D and save it to a second instance of the structure with the product, and the subscripts of the elements making up the product.
Sort both structions on the product.
Make a merge-type pass through both sorted structure arrays. Each time you find a product from one array that is close enough to the other, you can record the two subscripts from each sorted list for a match.
You can also use one sorted list for an ismatch function, doing a binary search on the product.
well。。Multiply each pair D and save it to a second instance of the structure with the product, and the subscripts of the elements making up the product.
I just asked my CS friend, and he came up with the algorithm below. He doesn't have an account here (and apparently unwilling to create one), but I think his answer is worth sharing.
// We will find the best match in the minimax sense; we will minimize
// the maximum matching error among all pairs. Alpha maintains a
// lower bound on the maximum matching error. We will raise Alpha until
// we find a solution. We assume MatchError returns an L_1 error.
// This first part finds the set of all possible alphas (which are
// the pairwise errors between all elements larger than maxi-min
// error.
Alpha = 0
For all i:
min = Infinity
For all j > i:
AlphaSet.Insert(MatchError(i,j))
if MatchError(i,j) < min
min = MatchError(i,j)
If min > Alpha
Alpha = min
Remove all elements of AlphaSet smaller than Alpha
// This next part increases Alpha until we find a solution
While !AlphaSet.Empty()
Alpha = AlphaSet.RemoveSmallest()
sol = GetBoundedErrorSolution(Alpha)
If sol != nil
Return sol
// This is the definition of the helper function. It returns
// a solution with maximum matching error <= Alpha or nil if
// no such solution exists.
GetBoundedErrorSolution(Alpha) :=
MaxAssignments = 0
For all i:
ValidAssignments[i] = empty set;
For all j > i:
if MatchError <= Alpha
ValidAssignments[i].Insert(j)
ValidAssignments[j].Insert(i)
// ValidAssignments[i].Size() > 0 due to our choice of Alpha
// in the outer loop
If ValidAssignments[i].Size() > MaxAssignments
MaxAssignments = ValidAssignments[i].Size()
If MaxAssignments = 1
return ValidAssignments
Else
G = graph(ValidAssignments)
// G is an undirected graph whose vertices are all values of i
// and edges between vertices if they have match error less
// than or equal to Alpha
If G has a perfect matching
// Note that this part is NP-complete.
Return the matching
Else
Return nil
It relies on being able to compute a perfect matching of a graph, which is NP-complete, but at least it is reduced to a known problem. It is expected that the solution be NP-complete, but this is OK since in practice the size of the given lists are quite small. I'll wait around for a better answer for a few days, or for someone to expand on how to find the perfect matching in a reasonable way.
You want to find j such that D(i)*D(j) = N(i)*N(j) {I assumed * is ordinary real multiplication}
assuming all N(i) are nonzero, let
Z(i) = D(i)/N(i).
Problem: find j, such that Z(i) = 1/Z(j).
Split set into positives and negatives and process separately.
take logs for clarity. z(i) = log Z(i).
Sort indirectly. Then in the sorted view you should have something like -5 -3 -1 +1 +3 +5, for example. Read off +/- pairs and that should give you the original indices.
Am I missing something, or is the problem easy?
Okay, I ended up using this ported Fortran code, where I simply specify the dense upper triangular distance matrix using:
complex_t num = N[i]*N[j] - D[i]*D[j];
complex_t den1 = N[j]*D[i];
complex_t den2 = N[i]*D[j];
if(std::abs(den1) < std::abs(den2)){
costs[j*(j-1)/2+i] = std::abs(-num/den2);
}else if(std::abs(den1) == 0){
costs[j*(j-1)/2+i] = std::sqrt(std::numeric_limits<double>::max());
}else{
costs[j*(j-1)/2+i] = std::abs(num/den1);
}
This works great and is fast enough for my purposes.
You should be able to sort the (D[i],N[i]) pairs. You don't need to divide by zero -- you can just multiply out, as follows:
bool order(i,j) {
float ni= N[i]; float di= D[i];
if(di<0) { di*=-1; ni*=-1; }
float nj= N[j]; float dj= D[j];
if(dj<0) { dj*=-1; nj*=-1; }
return ni*dj < nj*di;
}
Then, scan the sorted list to find two separation points: (N == D) and (N == -D); you can start matching reciprocal pairs from there, using:
abs(D[i]*D[j]-N[i]*N[j])<epsilon
as a validity check. Leave the (N == 0) and (D == 0) points for last; it doesn't matter whether you consider them negative or positive, as they will all match with each other.
edit: alternately, you could just handle (N==0) and (D==0) cases separately, removing them from the list. Then, you can use (N[i]/D[i]) to sort the rest of the indices. You still might want to start at 1.0 and -1.0, to make sure you can match near-zero cases with exactly-zero cases.

Resources