Aligning overlap of ordered and unordered lists

Aligning overlap of ordered and unordered lists - algorithm

I'm looking for an algorithm that can find/assign order and overlap given a list of ordered elements and a list of unordered elements. (of which overlap might or might not exist).
For this example I'll use the integers but they could just as well be peoples names, ID codes etc. IE the number can't be used to solve the real problem but to help explain the problem I used the ordered set (1,2,3,4,5,6,7,8,9,10) as the holy grail answer.
Input:
Ordered List of Lists: (1,2,3,4), (8,9,10), (3,4,5)
UnOrdered List of Lists: (3,4,2), (6,4,5,7), (10,9)
Thought process in how I do this algorithm in my head:
list 3,4,5 and 1,2,3,4 are ordered and have 3,4 in common therefore the 2 ordered lists overlap to form: 1,2,3,4,5 in that order.
The unordered list 3,4,2 is a subset of ordered list 1,2,3,4,5 therefor it could be reordered as 2,3,4 and said to overlap the ordered list 1,2,3,4,5
Same idea (as step 2) for the ordered list 8,9,10 when compared with the unordered 10,9. It should be 9,10 overlapped with 8,9,10.
Now comparing ordered list 1,2,3,4,5 and the unordered 6,4,5,7 they have an intersection set of 4,5 so you could conclude that its 1,2,3,4,5,(6,7|7,6) where (6,7|7,6) means that its either a 6 followed by a 7 or a 7 followed by a 6 (but its unknown which is correct)
Output:
I would like to beable to parse a matrix/tree/whatever kind of data structure to see what overlapped where and in what order
and a summarized list containing sets of partially known order
set1: 1,2,3,4,5,(6,7|7,6)
set2: 2: 8,9,10
Does anyone know of a similar problem or algorithm I could use? Ideally it would be in Perl but pseudo code or algorithms from another language would be fine.
Thanks

If I understand this right, you need one ordered list from a set of ordered and unordered lists. A possible solution would be to iterate over all the values in all the sets and add them to a hash table structure. In implementation terms that could be a c++ map, java hashmap, python dictionary etc. That would look like:
for i over all sets S //(Ordered and
Unordered)
for j over all values in S[i]
H.insert(S[i][j]) //H is the hash table
Now iterate over the hash table entries to get the required ordered list. This is quite practical and optimal.
A not-so-practical-solution but worth mentioning for its niceness is this:
Assign every unique number a corresponding prime number. For example, in your example case, map the numbers as follows:
p[1] = 2, p[2]=3, p[3]=5, p[4]=7, p[5]=11, p[6]= 13, p[7]=17, p[8]=19, p[9]=23, p[10]=29;
Now, each set Si can be represented by a value Vi- the product of the corresponding primes. So a set Si=(1,2,3) (or for that matter (2,1,3)) would have value Vi=p[1]*p[2]*p[3]
Find the LCM of all the Vi 's. Call this V.
V=For all i LCM{Vi}
Factorise V into its prime factors. Each prime number represents an element in your final ordered list.
This second solution is neat but breaks down for practical purposes because we enter bignum space very quickly.
Hope at least one of these work for you!

Related

Algorithm for predicting most likely items from lists of data

Lets say I have N lists which are known. Each list has items, which may repeat (Not a set)
eg:
{A,A,B,C}, {A,B,C}, {B,B,B,C,C}
I need some algorithm (Some machine-learning one maybe?) which answers the following question:
Given a new & unknown partial list of items, for example, {A,B}, what is the probability that C will appear in the list based on the what I know from the previous lists. If possible, I would like a more fine-grained probability of: given some partial list L, what is the probability that C will appear in the list once, probability it will appear twice, etc... Order doesn't matter. The probability of C appearing twice in {A,B} should equal it appearing twice in {B,A}
Any algorithms which can do this?

This is just pure mathematics, no actual "algorithms", simply estimate all the probabilities from your dataset (literally count the occurences). In particular you can do very simple data structure to achieve your goal. Represent each "list" as bag of letters, thus:
{A,A,B,C} -> {A:2, B:1, C:1}
{A,B} -> {A:1, B:1}
etc. and create basic reverse indexing of some sort, for example keep indexes for each letter separately, sorted by their counts.
Now, when a query comes, like {A,B} + C all you do is you search for your data that contains at least 1 A and 1 B (using your indexes), and then estimate probability by computing the fraction of retrived results containing C (or exactly one C) vs. all retrived results (this is a valid probability estimate assuming that your data is a bunch of independent samples from some underlying data-generating distribution).
Alternatively, if your alphabet is very small you can actually precompute all the values P(C|{A,B}) etc. for all combinations of letters.

Divide the list into 2 equal Parts

I have a list which contains random numbers such that Number >= 0. Now i have to divide the list into 2 equal parts (assume list contains even number of elements) such that all the numbers contain in first list are less than the numbers present in second list. This can be easily done by any sorting mechanism in O(nlogn). But i don't need data to be sorted in any two equal length list. Only condition is that (all elements in first list <= all elements in second list.)
So is there a way or hack we can reduce the complexity since we don't require sorted data here?

If the problem is actually solvable (data is right) you can find the median using the selection algorithm. When you have that you just create 2 equally sized arrays and iterate over the original list element by element putting each element into either of the new lists depending whether it's bigger or smaller than the median. Should run in linear time.
#Edit: as gen-y-s pointed out if you write the selection algorithm yourself or use a proper library it might already divide the input list so no need for the second pass.

Best algorithm to find N unique random numbers in VERY large array

I have an array with, for example, 1000000000000 of elements (integers). What is the best approach to pick, for example, only 3 random and unique elements from this array? Elements must be unique in whole array, not in list of N (3 in my example) elements.
I read about Reservoir sampling, but it provides only method to pick random numbers, which can be non-unique.

If the odds of hitting a non-unique value are low, your best bet will be to select 3 random numbers from the array, then check each against the entire array to ensure it is unique - if not, choose another random sample to replace it and repeat the test.
If the odds of hitting a non-unique value are high, this increases the number of times you'll need to scan the array looking for uniqueness and makes the simple solution non-optimal. In that case you'll want to split the task of ensuring unique numbers from the task of making a random selection.
Sorting the array is the easiest way to find duplicates. Most sorting algorithms are O(n log n), but since your keys are integers Radix sort can potentially be faster.
Another possibility is to use a hash table to find duplicates, but that will require significant space. You can use a smaller hash table or Bloom filter to identify potential duplicates, then use another method to go through that smaller list.

counts = [0] * (MAXINT-MININT+1)
for value in Elements:
counts[value] += 1
uniques = [c for c in counts where c==1]
result = random.pick_3_from(uniques)

I assume that you have a reasonable idea what fraction of the array values are likely to be unique. So you would know, for instance, that if you picked 1000 random array values, the odds are good that one is unique.
Step 1. Pick 3 random hash algorithms. They can all be the same algorithm, except that you add different integers to each as a first step.
Step 2. Scan the array. Hash each integer all three ways, and for each hash algorithm, keep track of the X lowest hash codes you get (you can use a priority queue for this), and keep a hash table of how many times each of those integers occurs.
Step 3. For each hash algorithm, look for a unique element in that bucket. If it is already picked in another bucket, find another. (Should be a rare boundary case.)
That is your set of three random unique elements. Every unique triple should have even odds of being picked.
(Note: For many purposes it would be fine to just use one hash algorithm and find 3 things from its list...)
This algorithm will succeed with high likelihood in one pass through the array. What is better yet is that the intermediate data structure that it uses is fairly small and is amenable to merging. Therefore this can be parallelized across machines for a very large data set.

Genetic algorithms: How to do crossover in "subset" problems?

I have a problem which I am trying to solve with genetic algorithms. The problem is selecting some subset (say 4) of 100 integers (these integers are just ids that represent something else). Order does not matter, the solution to the problem is a SET of integers not an ordered list. I have a good fitness function but am having trouble with the crossover function.
I want to be able to mate the following two chromosomes:
[1 2 3 4] and
[3 4 5 6] into something useful. Clearly I cannot use the typical crossover function because I could end up with duplicates in my children which would represent invalid solutions. What is the best crossover method in this case.

Just ignore any element that occurs in both of the sets (i.e. in their intersection.), that is leave such elements unchanged in both sets.
The rest of the elements form two disjoint sets, to which you can apply pretty much any random transformation (e.g. swapping some pairs randomly) without getting duplicates.
This can be thought of as ordering and aligning both sets so that matching elements face each other and applying one of the standard crossover algorithms.

Sometimes it is beneficial to let your solution go "out of bounds" so that your search will converge more quickly. Rather than making a set of 4 unique integers a requirement for your chromosome, make the number of integers (and their uniqueness) part of the fitness function.

Since order doesn't matter, just collect all the numbers into an array, sort the array, throw out the duplicates (by disconnecting them from a linked list, or setting them to a negative number, or whatever). Shuffle the array and take the first 4 numbers.

I don't really know what you mean on "typical crossover", but I think you could use a crossover similar to what is often used for permutations:
take m ints from the first parent (m < n, where n is the number of ints in your sets)
scan the second and fill your subset from it with (n-m) ints that are free (not in the subset already).
This way you will have n ints from the first and n-m ints from the second parent, without duplications.
Sounds like a valid crossover for me :-).
I guess it might be beneficial not to do either steps on ordered sets (or using an iterator where the order of returned elements correlates somehow with the natural ordering of ints), otherwise either smaller or higher numbers will get a higher chance to be in the child making your search biased.
If it is the best method depends on the problem you want to solve...

In order to combine sets A and B, you could choose the resulting set S probabilistically so that the probability that x is in S is (number of sets out of A, B, which contain x) / 2. This will be guaranteed to contain the intersection and be contained in the union, and will have expected cardinality 4.

top-k selection/merge

I have n sorted lists (5 < n < 300). These lists are quite long (300000+ tuples). Selecting the top k of the individual lists is of course trivial - they are right at the head of the lists.
Example for k = 2:
top2 (L1: [ 'a': 10, 'b': 4, 'c':3 ]) = ['a':10 'b':4]
top2 (L2: [ 'c': 5, 'b': 2, 'a':0 ]) = ['c':5 'b':2]
Where it gets more interesting is when I want the combined top k across all the sorted lists.
top2(L1+L2) = ['a':10, 'c':8]
Just combining of the top k of the individual list would not necessarily gives the correct results:
top2(top2(L1)+top2(L2)) = ['a':10, 'b':6]
The goal is to reduce the required space and keep the sorted lists small.
top2(topX(L1)+topX(L2)) = ['a':10, 'c':8]
The question is whether there is an algorithm to calculate the combined top k having the correct order while cutting off the long tail of the lists at a certain position. And if there is: How does one find the limit X where is is safe to cut?
Note: Correct counts are not important. Only the order is.
top2(magic([L1,L2])) = ['a', 'c']

This algorithm uses O(U) memory where U is the number of unique keys. I doubt a lower memory bounds can be achieved because it is impossible to tell which keys can be discarded until all the keys have been summed.
Make a master list of (key:total_count) tuples. Simply run through each list one item at a time, keeping a tally of how many times each key has been seen.
Use any top-k selection algorithm on the master list that does not use additional memory. One simple solution is to sort the list in place.

If I understand your question correctly, the correct output is the top 10 items, irrespective of the list from which each came. If that's correct, then start with the first 10 items in each list will allow you to generate the correct output (if you only want unique items in the output, but the inputs might contain duplicates, then you need 10 unique items in each list).
In the most extreme case, all the top items come from one list, and all items from the other lists are ignored. In this case, having 10 items in the one list will be sufficient to produce the correct result.

Associate an index with each of your n lists. Set it to point to the first element in each case.
Create a list-of-lists, and sort it by the indexed elements.
The indexed item on the top list in your list-of-lists is your first element.
Increment the index for the topmost list and remove that list from the list-of-lists and re-insert it based on the new value of its indexed element.
The indexed item on the top list in your list-of-lists is your next element
Goto 4 and repeat until done.
You didn't specify how many lists you have. If n is small, then step 4 can be done very simply (just re-sort the lists). As n grows you may want to think about more efficient ways to resort and almost-sorted list-of-lists.

I did not understand if an 'a' appears in two lists, their counts must be combined. Here is a new memory-efficient algorithm:
(New) Algorithm:
(Re-)sort each list by ID (not by count). To release memory, the list can be written back to disk. Only enough memory for the longest list is required.
Get the next lowest unprocessed ID and find the total count across all lists.
Insert the ID into a priority queue of k nodes. Use the total count as the node's priority (not the ID). This priority queue drops the lowest node if more than k nodes are inserted.
Go to step 2 until all ID's have been exhausted.
Analysis: This algorithm can be implemented using only O(k) additional memory to store the min-heap. It makes several trade-offs to accomplish this:
The lists are sorted by ID in place; the original orderings by counts are lost. Otherwise O(U) additional memory is required to make a master list with ID: total_count tuples where U is number of unique ID's.
The next lowest ID is found in O(n) time by checking the first tuple of each list. This is repeated U times where U is the number of unique ID's. This might be improved by using a min-heap to track the next lowest ID. This would require O(n) additional memory (and may not be faster in all cases).
Note: This algorithm assumes ID's can be quickly compared. String comparisons are not trivial. I suggest hashing string ID's to integers. They do not have to be unique hashes, but collisions must be checked so all ID's are properly sorted/compared. Of course, this would add to the memory/time complexity.

The perfect solution requires all tuples to be inspected at least once.
However, it is possible to get close to the perfect solution without inspecting every tuple. Discarding the "long tail" introduces a margin of error. You can use some type of heuristic to calculate when the margin of error is acceptable.
For example, if there are n=100 sorted lists and you have inspected down each list until the count is 2, the most the total count for a key could increase by is 200.
I suggest taking an iterative approach:
Tally each list until a certain lower count threshold L is reached.
Lower L to include more tuples.
Add the new tuples to the counts tallied so far.
Go to step 2 until lowering L does not change the top k counts by more than a certain percentage.
This algorithm assumes the counts for the top k keys will approach a certain value the further long tail is traversed. You can use other heuristics instead of the certain percentage like number of new keys in the top k, how much the top k keys were shuffled, etc...

There is a sane way to implement this through mapreduce:
http://www.yourdailygeekery.com/2011/05/16/top-k-with-mapreduce.html

In general, I think you are in trouble. Imagine the following lists:
['a':100, 'b':99, ...]
['c':90, 'd':89, ..., 'b':2]
and you have k=1 (i.e. you want only the top one). 'b' is the right answer, but you need to look all the way down to the end of the second list to realize that 'b' beats 'a'.
Edit:
If you have the right distribution (long, low count tails), you might be able to do better. Let's keep with k=1 for now to make our lives easier.
The basic algorithm is to keep a hash map of the keys you've seen so far and their associated totals. Walk down the lists processing elements and updating your map.
The key observation is that a key can gain in count by at most the sum of the counts at the current processing point of each list (call that sum S). So on each step, you can prune from your hash map any keys whose total is more than S below your current maximum count element. (I'm not sure what data structure you would need to prune as you need to look up keys given a range of counts - maybe a priority queue?)
When your hash map has only one element in it, and its count is at least S, then you can stop processing the lists and return that element as the answer. If your count distribution plays nice, this early exit may actually trigger so you don't have to process all of the lists.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio