Find mode of a multiset in given time bound (most multiplicity) - algorithm

The given problem:
A multiset is a set in which some of the elements occur more then once (e.g. {a, f, b, b, e, c, b, g, a, i, b} is a multiset). The elements are drawn from a totally ordered set. Present an algorithm, when presented with a multiset as input, finds an element that has the most occurrences in the multiset (e.g. in {a, f, b, b, e, c, b, g, a, c, b}, b has the most occurrences). The algorithm
should run in O(n lg n/M +n) time, where n is the number of elements in the multiset and M is the highest number of occurrences of an element in the multiset. Note that you do not know the value of M.
[Hint: Use a divide-and-conquer strategy based on the median of the list. The subproblems generated by the divide-and-conquer strategy cannot be smaller than a ‘certain’ size in order
to achieve the given time bound.]
Our initial solution:
Our idea was to use Moore's majority algorithm to determine if the multiset contained a majority candidate (eg. {a, b, b} has a majority, b). After determining if this was true or false we either output the result or find the median of the list using a given algorithm (known as Select) and split the list into three sublists (elements less than and equal to the median, and elements greater than the median). Again, we would check each of the lists to determine if the majority element was present, if so, that is your result.
For example, given the multiset {a, b, c, d, d, e, f}
Step 1: check for majority. None found, split the list based on the median.
Step 2: L1 = {a, b, c, d, d}, L2 = {e, f} Find the majority of each. None found, split the lists again.
Step 3: L11 = {a, b, c} L12 = {d, d} L21 = {e} L22 = {f} Check each for majority elements. L12 returns d. In this case, d is the most occurring elements in the original multiset, thus is the answer.
The issues we're having are whether this type of algorithm is fast enough, as well as whether this can be done recursively or if a loop that terminates is required. Like the hint says, the sub-problems cannot be smaller than a 'certain' size, which we believe to be M (the most occurrences).

If you use recursion in a most straight-forward way as described in your post, it will not have the desired time complexity. Why? Let's assume that the answer element is the largest one. Then it is always located in the right branch of recursion. But we call the left branch first, which can go much deeper if all elements are distinct there(getting pieces of size 1, while we do not want to get them smaller than M).
Here is a correct solution:
Let's always split the array into three parts at each step as described in your question. Now let's step aside and take a look at what we have: recursive calls form a tree. To get the desired time complexity, we should never go deeper than the level where the answer is located. To achieve this, we can traverse the tree using a breadth-first search with queue instead of a depth-first search. That's it.

If you want to do this in real life, it is worth considering using a hash table to track the counts. This can have amortized O(1) complexity per hash table access, so the overall complexity of the following Python code is O(n).
import collections
C = collections.Counter(['a','f','b','b','e','c','b','g','a','i','b'])
most_common_element, highest_count = C.most_common(1)[0]

Related

Given values v1, v2, ..., which get updated one by one, maintain the maximum over subsets (v_i1, v_i2, ...)

To set some notation, we have an array of size N consisting of non-negative floats V = [v1, v2, ..., vN], as well as M subsets S1, S2, ..., SM of {1, 2, ..., N} (the subsets will overlap). We are interested in the quantities w_j = max(v_i for i in Sj). The problem is to devise a data structure which can maintain w_j as efficiently as possible, while the values in the array V get updated one by one. We should assume that M >> N.
One idea is to construct the "inverse" of the subsets S, namely subsets T1, T2, ..., TN of {1, 2, ..., M} such that i in Sj if and only if j in Ti. Then, if vi is updated, scan every j in Ti and calculate w_j from scratch. This takes O(TN) time, where T is the maximum size of any Ti subset.
I believe I see a way to maintain these in O(T log N) time, but the algorithm involves a rather convoluted structure of copies of binary search trees and lookup tables. Is there a simpler data structure to use or a simple known solution to this problem? does this problem have a name?
As well, since we have M >> N, it would be ideal to reduce the complexity from O(M), but is this even possible?
Edit: The goal is to construct some data structure which allows efficiently maintaining the maximums when the V array is updated. You cannot construct this data structure in less than O(M) time, but it may be possible to update it in less then that whenever a single entry of the V array changes.
According to my comment, We have M sets that maybe have overlap. On the other hand, each set contains at least one number. So we need to read at least one time M sets with size at least 1. as a result our lower bound for this problem is Ω(M).

Establishing chronology of a list based on a directed graph tree

So, in a personal project I've been working on I came across a following problem, and I've been struggling to come up with a solution since my maths skills are not terribly great.
Lets say you have a following tree of numbers a b c d e f g h:
a
/ \
b c
/ | |
g d f
| |
h e
Each step down the tree means that the next number is bigger then the previous one. So a < b, d < e, a < c. However, it is impossible to determine whether b > c or c < b - we can only tell that both numbers are bigger then a.
Lets say we have an ordered list of numbers, for instance [a, b, c, d, e]. How do we write an algorithm that checks if the order of the numbers in the list (assuming that L[i] < L[i+1]) is, in fact, correct in relation to the information we have accoring to this tree?
I. E, both [a, c, b, d, e] and [a, b, d, c, e] are correct, but [c, a, b, d, e] is not (since we know that c > a but nothing else in relation to how the other numbers are structured).
For the sake of the algorithm, lets assume that our access to the tree is a function provably_greater(X, Y) which returns true if the tree knows that a number is higher then another number. I.E. provably_greater(a, d) = True, but provably_greater(d, f) = False. Naturally if a number is provably not greater, it also returns false.
This is not a homework question, I have abstracted the problem quite a lot to make it more clear, but solving this problem is quite crucial for what I'm trying to do. I've made several attempts at cracking it myself, but everything that I come up with ends up being insufficient for some edge case I find out about later.
Thanks in advance.
Your statement "everything that I come up with ends up being insufficient for some edge case I find out about later" makes it seem that you have no solution at all. Here is a brute-force algorithm that should work in all cases. I can think of several possible ways to improve the speed, but this is a start.
First, set up a data structure that allows a quick evaluation of provably_greater(X, Y) based on the tree. This structure can be a set or hash-table, which will take much memory but allow fast access. For each leaf of the tree, take a path up to the root. At each node, look at all descendant nodes and add an ordered pair to the set that shows the less-than relation for those two nodes. In your example tree, if you start at node h you move up to node g and add (g,h) to the set, then move up to node b and add the pairs (b,h) and (b,g) to the set, then move up to node a and add the pairs (a,h), (a,g), and (a,b) to the set. Do the same for leaf nodes e and f. The pair (a,b) will be added twice to the set, due to the leaf nodes h and e, but a set structure can handle this easily.
The function provably_greater(X, Y) is now quick and easy: the result is True if the pair (Y,X) is in the set and False otherwise.
You now look at all pairs of numbers in your list--for the list [a,b,c,d,e] you would look at the pairs (a,b), (a,c), (b,c), etc. If provably_greater(X, Y) is true for any of those pairs, the list is out of order. Otherwise, the list is in order.
This should be very easy to implement in a language like Python. Let me know if you would like some Python 3 code.
I'm going to ignore your provably_greater function and assume access to the tree so that I can provide an efficient algorithm.
First, perform a Euler tour of the tree, remembering the start and end indexes for every node. If you use the same tree to check a lot of lists, you only have to do this once. See https://www.geeksforgeeks.org/euler-tour-tree/
Create an initially empty binary search tree of indexes
Iterate through the list. For each node, check to see if the tree contains any indexes between its start and end Euler tour indexes. If it does, then the list is out of order. If it does not, then insert its start index into the tree. This will prevent any provably lesser node from appearing later in the list.
That's it -- O(N log N) altogether, and for each list.
A TreeSet in Java or std::set in C++ can be used for the binary search tree.

Sorting sequences where the binary sorting function return is undefined for some pairs

I'm doing some comp. mathematics work where I'm trying to sort a sequence with a complex mathematical sorting predicate, which isn't always defined between two elements in the sequence. I'm trying to learn more about sorting algorithms that gracefully handle element-wise comparisons that cannot be made, as I've only managed a very rudimentary approach so far.
My apologies if this question is some classical problem and it takes me some time to define it, algorithmic design isn't my strong suit.
Defining the problem
Suppose I have a sequence A = {a, b, c, d, e}. Let's define f(x,y) to be a binary function which returns 0 if x < y and 1 if y <= x, by applying some complex sorting criteria.
Under normal conditions, this would provide enough detail for us to sort A. However, f can also return -1, if the sorting criteria is not well-defined for that particular pair of inputs. The undefined-ness of a pair of inputs is commutative, i.e. f(q,r) is undefined if and only if f(r,q) is undefined.
I want to try to sort the sequence A if possible with the sorting criterion that are well defined.
For instance let's suppose that
f(a,d) = f(d,a) is undefined.
All other input pairs to f are well defined.
Then despite not knowing the inequality relation between a and d, we will be able to sort A based on the well-defined sorting criteria as long as a and d are not adjacent to one another in the resulting "sorted" sequence.
For instance, suppose we first determined the relative sorting of A - {d} to be {c, a, b, e}, as all of those pairs to fare well-defined. This could invoke any sorting algorithm, really.
Then we might call f(d,c), and
if d < c we are done - the sorted sequence is indeed {d, c, a, b, e}.
Else, we move to the next element in the sequence, and try to call f(a, d). This is undefined, so we cannot establish d's position from this angle.
We then call f(d, e), and move from right to left element-wise.
If we find some element x where d > x, we are done.
If we end up back at comparing f(a, d) once again, we have established that we cannot sort our sequence based on the well-defined sorting criterion we have.
The question
Is there a classification for these kinds of sorting algorithms, which handle undefined comparison pairs?
Better yet although not expected, is there a well-known "efficient" approach? I have defined my own extremely rudimentary brute-force algorithm which solves this problem, but I am certain it is not ideal.
It effectively just throws out all sequence elements which cannot be compared when encountered, and sorts the remaining subsequence if any elements remain, before exhaustively attempting to place all of the sequence elements which are not comparable to all other elements into the sorted subsequence.
Simply a path on which to do further research into this topic would be great - I lack experience with algorithms and consequently have struggled to find out where I should be looking for some more background on these sorts of problems.
This is very close to topological sorting, with your binary relation being edges. In particular, this is just extending a partial order into a total order. Naively if you consider all pairs using toposort (which is O(V+E)) you have a worst case O(n^2) algorithm (actually O(n+p) with n being the number of elements and p the number of comparable pairs).

Is it possible to find the list of attributes which would yield to the greatest sum without brute forcing?

I have about 2M records stored in a table.
Each record has a number and about 5K boolean attributes.
So the table looks something like this.
3, T, F, T, F, T, T, ...
29, F, F, T, F, T, T, ...
...
-87, T, F, T, F, T, T, ...
98, F, F, T, F, F, T, ...
And I defined SUM(A, B) as the sum of the numbers where Ath and Bth attributes are true.
For example, from the sample data above: SUM(1, 3) = 3 + ... + (-87) because the 1st and the 3rd attributes are T for 3 and -87
3, (T), F, (T), F, T, T, ...
29, (F), F, (T), F, T, T, ...
...
-87, (T), F, (T), F, T, T, ...
98, (F), F, (T), F, F, T, ...
And SUM() can take any number of parameters: SUM(1) and SUM(5, 7, ..., 3455) are all possible.
Are there some smart algorithms for finding a list of attributes L where SUM(L) would yields to the maximum result?
Obviously, brute forcing is not feasible for this large data set.
It would be awesome if there is a way to find not only the maximum but top N lists.
EDIT
It seems like it is not possible to find THE answer without brute forcing. If I changed the question to find a "good estimation", would there be a good way to do it?
Or, what if I said the cardinality of L is fixed to something like 10, would there be a way to calculate the L?
I would be happy with any.
Unfortunately, this problem is NP-complete. Your options are limited to finding a good but non-maximal solution with an approximation algorithm, or using branch-and-bound and hoping that you don't hit exponential runtime.
Proof of NP-completeness
To prove that your problem is NP-complete, we reduce the set cover problem to your problem. Suppose we have a set U of N elements, and a set S of M subsets of U, where the union of all sets in S is U. The set cover problem asks for the smallest subset T of S such that every element of U is contained in an element of T. If we had a polynomial-time algorithm to solve your problem, we could solve the set cover problem as follows:
First, construct a table with M+N rows and M attributes. The first N rows are "element" rows, each corresponding to an element of U. These have value "negative enough"; -M-1 should be enough. For element row i, the jth attribute is true if the corresponding element is not in the jth set in S.
The last M rows are "set" rows, each corresponding to a set in S. These have value 1. For set row N+i, the ith attribute is false and all others are true.
The values of the element rows are small enough that any choice of attributes that excludes all element rows beats any choice of attributes that includes any element row. Since the union of all sets in S is U, picking all attributes excludes all element rows, so the best choice of attributes is the one that includes the most set rows without including any element rows. By the construction of the table, a choice of attributes will exclude all element rows if the union of the corresponding sets is U, and if it does, its score will be better the fewer attributes it includes. Thus, the best choice of attributes corresponds directly to a minimum cover of S.
If we had a good algorithm to pick a choice of attributes that produces the maximal sum, we could apply it to this table to generate the minimum cover of an arbitrary S. Thus, your problem is as hard as the NP-complete set cover problem, and you should not waste your time trying to come up with an efficient algorithm to generate the perfect choice of attributes.
You could try a genetic algorithm approach, starting out with a certain (large) number of random attribute combinations, letting the worst x% die and mutating a certain percentage of the remaining population by adding/removing attributes.
There is no guarantee that you will find the optimal answer, but a good chance to find a good one within reasonable time.
No polynomial algorithms to solve this problem come to my mind. I can only suggest you a greedy heuristic:
For each attribute, compute its expected_score, i.e. the addend it would bring to your SUM, if selected alone. In your example, the score of 1 is 3 - 87 = -84.
Sort the attributes by expected_score in non-increasing order.
By following that order, greedily add to L the attributes. Call actual_score the score that the attribute a will actually bring to your sum (it can be better or worse than expected_score, depending on the attributes you already have in L). If actual_score(a) is not strictly positive, discard a.
This will not give you the optimal L, but I think a "fairly good" one.
Note: see below why this approach will not give the best results.
My first approach would be to start off with the special case L={} (which should give the sum of all integers) and add that to a list of solutions. From there add possible attributes as restrictions. In the first iteration, try each attribute in turn and remember those that gave a better result. After that iteration, put the remembered ones into a list of solutions.
In the second iteration, try to add another attribute to each of the remembered ones. Remember all those that improved the result. Remove duplicates from the remembered attribute combinations and add these to the list of solutions. Note that {m,n} is the same as {n,m}, so skip redundant combinations in order not to blow up your sets.
Repeat the second iterations until there are no more possible attributes that could be added to improve the final sum. If you then order the list of solutions by their sum, you get the requested solution.
Note that there are ~20G ways to select three attributes out of 5k, so you can't build a data structure containing those but you must absolutely generate them on demand. Still, the sheer amount can produce lots of temporary solutions, so you have to store those efficiently and perhaps even on disk. You can exploit the fact that you only need the previous iteration's solutions for the next iterations, not the ones before.
Another restriction here is that you can end up with less than N best solutions, because all those below L={} are not considered. In that case, I would accept all possible solutions until you have N solutions, and only once you have the N solutions discard those that don't give an improvement over the worst one.
Python code:
solutions = [{}]
remembered = [{}]
while remembered:
tmp = remembered
remembered = []
for s in remembered:
for xs in extensions(s):
if score(xs) > score(s)
remembered.append(xs)
solutions.extend(remembered)
Why this doesn't work:
Consider a temporary solution consisting of the three records
-2, T, F
-2, F, T
+3, F, F
The overall sum of these is -1. When I now select the first attribute, I discard the second and third record, giving a sum of -2. When selecting the second attribute, I discard the first and third, giving the same sum of -2. When selecting both the first and second attribute, I discard all three records, giving a sum of zero, which is an improvement.

Comparison Based Ranking Algorithm (Variation)

This question is a variation on a previous question:
Comparison-based ranking algorithm
The variation I would like to pose is: what if loops are solved by discarding the earliest contradicting choices so that a transitive algorithm could actually be used.
Here I have pasted the original question:
"I would like to rank or sort a collection of items (with size potentially greater than 100,000) where each item in the collection does not have an intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a 'subjective' manner.
Example:
Consider a collection with elements [a, b, c, d]. And comparisons by users:
b > a, a > d, d > c
The correct order of this collection would be [b, a, d, c].
This example is simple however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above. Also you may not have comparisons that 'connects' all the items, ie:
b > a, d > c. In which case the ordering is ambiguous. It could be : [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
...
The Question is:
Is there an algorithm which already exists that can solve the problem above, I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?"
The simpler version where no "cycle" exists can be dealt with using topological sorting.
Now, for the more complex scenario, if for every "cycle" the order on which the elements appear in the final ranking does not matter, then you could try the following:
model the problem as a directed graph (i.e. the fact that a > b implies that there is an edge in the resulting graph starting in node "a" and ending in node "b").
calculate the strongly connected components (SCC) of the graph. In short, an SCC is a set of nodes with the property that you can get to any node in the set from any node in the set by following a list of edges (this corresponds to your "cycles" in the original problem).
transform the graph by "collapsing" each node into the SCC it belongs to, but preserve the edges that that go between different SCC's.
it turns out the new graph obtained in the way mentioned above is a directly acyclic graph so we can perform a topological sort on it.
Finally, we're ready. The topological sort should tell you the right order in which to print nodes in different SCC's. For the nodes in the same SCC's, no matter what the order you choose is, there will always be "cycles", so a possibility might be printing them in a random order.

Resources