Which algorithm can rank list based on history - algorithm

There are N unique items.
There are K sorted lists, each list consists of a small subset of the items, each list does not contain the same item more than once.
The input is an unsorted list of items.
The algorithm should sort the list based on the K sorted lists.
Here is an example:
There are 100 items : item1, item2, ..., item100
There are some ranked lists available: List1: Item1>Item2>Item12, List2: Item12>item93>Item7, List3: Iterm1>Item3>Iterm97, List4: Iterm1>Iterm7>Item2
The input is: Iterm1, Item2, Iterm7 and Item98. The algorithm should sort the input based on those lists.
In terms of machine learning I am looking for an algorithm that can predict the 'right' order of a list of items (AKA active list) based on a training set of many partially ordered lists of items, each partial ordered list of items might contain other items that the active list does not contain.

Construct a directed acyclic graph (DAG) with input elements as nodes and define an edge from Itemi and Itemj if and only if Itemi appears immediately before Itemj in some list. Then you can obtain the desired order by doing a topological sort on the DAG.

I think what you mean is that the sorted lists define a partial ordering, yes? I.e. if Item1 appears before Item2 in one of the lists, it should be considered "bigger".
If this is correct, than the way to go is to first represent this in a more convenient form, e.g. a matrix M, such that M[1][2]==1 if Item1 precedes Item2 in one of the list. Then we have a simple comparator function:
if M[X][Y] == 1:
return 1 # X > Y
elif M[Y][X] == 1:
return -1 # Y > X
else
return 0 # the elements are not comparable
We can now sort the output according to this comparator.
You might want to run the transitive closure (Warshall's algorithm) on this matrix before sorting, in case there are, for example, lists Item1>Item3 and Item3>Item2, but no list where Item2 would appear together with Item1. Transitive closure would allow one to deduce from the two lists that Item1 should precede Item2.

I would compose a weighted graph from the input (number of links between A>B is the weight), put that into an N*N matrix, and perform the power-iteration (GIYF) on the matrix.

Related

Groupping a list of tuples into sets with coordinate wise matching

Suppose that there is a list of n l-tuples. One is interested in grouping this list into sets where each set containing m tuples such that there is a maximal match coordinate wise. For example:
input: {(1,2,3), (1,2,4), (2,3,1), (2,3,1), (4,3,1), (2,1,4)}, m = 3
output: {(1,2,3), (1,2,4), (2,1,4)}, {(2,3,1), (2,3,1), (4,3,1)}
It is important to note the possibility of those cases(with certain values for n and m) that result in having a set with fewer elements than others or a set with more elements than m elements(tuples).
Questions: What is the name of this problem in the literature? Is there exists an algorithm that performing this task? What about leaving the number of tuples in each partition, m not fixed, and determining the optimal such m with the given restriction, if it makes sense.
Thank you.

How to assign many subsets to their largest supersets?

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET

Looking for algorithm to match up objects from 2 lists depending of distance

So I have 2 lists of objects, with a position for each one. I would like to match every object from the first list with an object of the second list.
Once the object of the second list is selected for a match up, we remove it from the list (thus it can not be matched with another one). And most importantly, the total sum of distances between the matched up objects should be the least possible.
For example:
list1 { A, B, C } list2 { X, Y, Z }
So if I match up A->X (dist: 3meters) B->Z (dist: 2meters) C->Y (dist: 4meters)
Total sum = 3 + 2 + 4 = 9meters
We could have another match up with A->Y (4meters) B->X (1meter) C->Z (3meters)
Total sum = 4 + 1 + 3 = 8meters <======= Better solution
Thank you for your help.
Extra: Lists could have different length.
This problem is known as the Assignment Problem (a weighted matching in bipartite graphs).
An algorithm which solves this is the Hungarian algorithm. At the bottom of the wikipedia article is also a list of implementations.
If your data has special properties, like your two sets are 2D points and the weight of an edge is the euclidean distance, then there are better algorithms for this.

Check if a collection of sets is pairwise disjoint

What is the most efficient way to determine whether a collection of sets is pairwise disjoint? -- i.e. verifying that the intersection between all pairs of sets is empty. How efficiently can this be done?
The sets from a collection are pairwise disjoint if, and only if, the size of their union equals the sum of their sizes (this statement applies to finite sets):
def pairwise_disjoint(sets) -> bool:
union = set().union(*sets)
return len(union) == sum(map(len, sets))
This could be a one-liner, but readability counts.
Expected linear time O(total number of elements):
def all_disjoint(sets):
union = set()
for s in sets:
for x in s:
if x in union:
return False
union.add(x)
return True
This is optimal under the assumption that your input is a collection of sets represented as some kind of unordered data structure (hash table?), because than you need to look at every element at least once.
You can do much better by using a different representation for your sets. For example, by maintaining a global hash table that stores for each element the number of sets it is stored in, you can do all the set operations optimally and also check for disjointness in O(1).
Using Python as psudo-code. The following tests for the intersection of each pair of sets only once.
def all_disjoint(sets):
S = list(sets)
while S:
s = S.pop() # remove an element
# loop over the remaining ones
for t in S:
# test for intersection
if not s.isdisjoint(t):
return False
return True
The number of intersection tests is the same as the number of edges in a fully connected graph with the same number of vertexes as there are sets. It also exits early if any pair is found not to be disjoint.

Find if any set is covered by member sets

[Please let me know if this maps to a known problem]
I have n sets of varying sizes. Each element in a set is unique. And each element can occur atmost in two different sets.
I want to perform an operation on these sets but avoid duplicates or missing any element.
Problem: Find out which all of these n sets should be removed because they are covered by other sets.
E.g. [a,b,c]; [a]; [b]. Remove [a], [b] since both are covered by the first one.
E.g. [a,b,c]; [a]; [b]; [c,d]. Remove [a,b,c] since all three elements are covered by remaining sets.
Note: here [a],[b] alone is not valid answer since 'c' is being duplicated. Similarly [a],[b],[c,d] is not valid answer since 'd' will be missed if removed.
I think that this is the Exact Cover problem. The last constraint—that each element is in at most two sets—doesn't seem to me to fundamentally change the problem (although I could easily be wrong about this). The Wikipedia web page contains a good summary of various algorithmic approaches. The algorithm of choice seems to be Dancing Links.
I think this is a case of a 2-sat problem that can be solved in linear time using a method based on Tarjan's algorithm.
Make a variable Ai for each set i. Ai is true if and only if set i is to be included.
For each element that appears in a single set add a clause that Ai=1
For each element that appears in 2 sets i and j, add clauses (Ai && ~Aj) || (~Ai && Aj). These clauses meant that exactly one of Ai and Aj must appear.
You can now solve this using a standard 2-sat algorithm to find whether this is impossible to achieve or a satisfying assignment if it is possible.
For a case with V sets and N elements you will have V variables and up to 2N clauses, so Tarjan's algorithm will have complexity O(V+2N).
Since an element in a set can appear in no more than two sets, then there are fairly straightforward connections between sets, which can be shown as a graph, the two examples are shown below. One example uses red lines to represent edges and the other uses black lines to represent edges.
The above shows that the sets can be divided into three groups.
Sets where all elements appear twice. These sets could potentially be removed and/or the sets that contain those elements could be removed.
Sets where one or more elements appear twice. The elements that appear twice could potentially link to sets that could be removed.
Sets where no elements appear twice. These sets can be ignored.
It's not really clear what happens if all of the sets are in either group 1 or group 3. However there seems to be a fairly simple criterion that allows for quickly removing sets, and the psudocode looks like so:
for each set in group2:
for each element that appears twice in that set:
if the other set that contains that element is in group1:
remove the other set
The performance is then linear in the number of elements.
I tried to find which sets to include rather than remove. Something like this?
(1) List of elements and the indexes of sets they are in
(2) Prime the answer list with indexes of sets that have elements that appear only in them
(3) Comb the map from (1) and if an element's set-index is not in the answer list, add to the answer the index of the smallest set that element is in.
Haskell code:
import Data.List (nub, minimumBy, intersect)
sets = [["a","b","c","e"],["a","b","d"],["a","c","e"]]
lengths = map length sets
--List elements and the indexes of sets they are in
mapped = foldr map [] (nub . concat $ sets) where
map a b = comb (a,[]) sets 0 : b
comb result [] _ = result
comb (a,list) (x:xs) index | elem a x = comb (a,index:list) xs (index + 1)
| otherwise = comb (a,list) xs (index + 1)
--List indexes of sets that have elements that appear only in them
haveUnique = map (head . snd)
. filter (\(element,list) -> null . drop 1 $ list)
$ mapped
--Comb the map and if an element's set-index is not in the answer list,
--add to the answer the index of the smallest set that element is in.
answer = foldr comb haveUnique mapped where
comb (a,list) b
| not . null . intersect list $ b = b
| otherwise =
minimumBy (\setIndexA setIndexB ->
compare (lengths!!setIndexA) (lengths!!setIndexB)) list : b
OUTPUT:
*Main> sets
[["a","b","c","e"],["a","b","d"],["a","c","e"]]
*Main> mapped
[("a",[2,1,0]),("b",[1,0]),("c",[2,0]),("e",[2,0]),("d",[1])]
*Main> haveUnique
[1]
*Main> answer
[2,1]

Resources