Find timestamp given K best candidates - algorithm

So I was asked a weird inversion of the K best candidates problem. The normal problem is as follows.
Given a list of 'votes' which are tuples of timestamps and candidates like below:
(111111, Clinton)
(111111, Bush)
...
Return the top K candidates with the most votes.
Its a typical problem and the solution is to use a hashmap of candidates->votes within the timestamp bound also build a min heap of size K where basically the top of the heap is the candidate that is vulnerable to being ejected from the K best candidates.
In the end you return the heap.
But I was asked in the end: Given a list of K candidates, return the timestamp that matches these as the K best candidates. I'm not sure if I'm recalling the question 100% correctly because it would have to either be the first occurrence of these K candidates as the best or I would have been given their vote tally.

If I understand everything, votes is a list of vote tuples that are made up of candidates that are being voted for and the timestamp of the vote taking place. currTime is the timestamp of all of the votes during that timestamp of before it. topCandidates are the candidates with the highest vote at currTime.
Your first question gives you votes and currTime, you are expected to return topCandidates. Your second question gives you votes and topCandidates, and you are expected to return currTime.
Focusing on the second question, I would make a map where the keys are a timestamp and the values are all of the votes taking place at that moment. Also I would create another map where the key is the candidate and the value is the number of votes they have so far. I would traverse the first map in the ascending timestamp order of the first map, get all of the votes that were cast at the timestamp and increment the second map's values by their candidate (key). Before going through the next timestamp, I would create a list of the most voted for candidates with the data in the second map. If that list matches topCandidates, then the last timestamp you traversed through is currTime.
To code this in python:
from collections import Counter, defaultdict
def findCurrTime(votes, topCandidates):
if not (votes and topCandidates):
return -1
votesAtTime = defaultdict(list)
candidatePoll = Counter()
k = len(topCandidates)
for time, candidate in votes: # votes = [(time0, candidate0), ...]
votesAtTime[time].append(candidate)
for ts in votesAtTime:
candidatePoll += Counter(voteAtTime[ts])
if list(map(lambda pair: pair[0],candidatePoll.most_common(k))) == topCandidates:
return ts
# if topCandidates cannot be created from these votes:
return -1
There are some assumptions that I've made (that you hopefully asked your interviewer about). I assumed that the order of topCandidates mattered which Counter.most_common handled, although it won't handle candidates with number of votes.
The time complexity is O(t * n * log(k)) with t being the number of timestamps, n being the number of votes and k being the size of topCandidates. This is because Counter.most_common looks to be O(n*log(k)) and it can run t times. There are definitely more efficient answers though.

Related

How do I find the right optimisation algorithm for my problem?

Disclaimer: I'm not a professional programmer or mathematician and this is my first time encountering the field of optimisation problems. Now that's out of the way so let's get to the problem at hand:
I got several lists, each containing various items and number called 'mandatoryAmount':
listA (mandatoryAmountA, itemA1, itemA2, itemA2, ...)
Each item has certain values (each value is a number >= 0):
itemA1 (M, E, P, C, Al, Ac, D, Ab,S)
I have to choose a certain number of items from each list determined by 'mandatoryAmount'.
Within each list I can choose every item multiple times.
Once I have all of the items from each list, I'll add up the values of each.
For example:
totalM = listA (itemA1 (M) + itemA1 (M) + itemA3 (M)) + listB (itemB1 (M) + itemB2 (M))
The goals are:
-To have certain values (totalAl, totalAc, totalAb, totalS) reach a certain number cap while going over that cap as little as possible. Anything over that cap is wasted.
-To maximize the remaining values with different weightings each
The output should be the best possible selection of items to meet the goals stated above. I imagine the evaluation function to just add up all non-waste values times their respective weightings while subtracting all wasted stats times their respective weightings.
edit:
The total amount of items across all lists should be somewhere between 500 and 1000, the number of lists is around 10 and the mandatoryAmount for each list is between 0 and 14.
Here's some sample code that uses Python 3 and OR-Tools. Let's start by
defining the input representation and a random instance.
import collections
import random
Item = collections.namedtuple("Item", ["M", "E", "P", "C", "Al", "Ac", "D", "Ab", "S"])
List = collections.namedtuple("List", ["mandatoryAmount", "items"])
def RandomItem():
return Item(
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
random.random(),
)
lists = [
List(
random.randrange(5, 10), [RandomItem() for j in range(random.randrange(5, 10))]
)
for i in range(random.randrange(5, 10))
]
Time to formulate the optimization as a mixed-integer program. Let's import
the solver library and initialize the solver object.
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver.CreateSolver("solver", "SCIP")
Make constraints for the totals that must reach a certain cap.
AlCap = random.random()
totalAl = solver.Constraint(AlCap, solver.infinity())
AcCap = random.random()
totalAc = solver.Constraint(AcCap, solver.infinity())
AbCap = random.random()
totalAb = solver.Constraint(AbCap, solver.infinity())
SCap = random.random()
totalS = solver.Constraint(SCap, solver.infinity())
We want to maximize the other values subject to some weighting.
MWeight = random.random()
EWeight = random.random()
PWeight = random.random()
CWeight = random.random()
DWeight = random.random()
solver.Objective().SetMaximization()
Create variables and fill in the constraints. For each list there is an
equality constraint on the number of items.
associations = []
for list_ in lists:
amount = solver.Constraint(list_.mandatoryAmount, list_.mandatoryAmount)
for item in list_.items:
x = solver.IntVar(0, solver.infinity(), "")
amount.SetCoefficient(x, 1)
totalAl.SetCoefficient(x, item.Al)
totalAc.SetCoefficient(x, item.Ac)
totalAb.SetCoefficient(x, item.Ab)
totalS.SetCoefficient(x, item.S)
solver.Objective().SetCoefficient(
x,
MWeight * item.M
+ EWeight * item.E
+ PWeight * item.P
+ CWeight * item.C
+ DWeight * item.D,
)
associations.append((item, x))
if solver.Solve() != solver.OPTIMAL:
raise RuntimeError
solution = []
for item, x in associations:
solution += [item] * round(x.solution_value())
print(solution)
I think David Eisenstat has the right idea with Integer programming, but let's see if we get some good solutions otherwise and perhaps provide some initial optimization. However, I think that we can just choose all of one item in each list may make this easier to solve that it normally would be. Basically that turns it into more of a Subset Sum problem. Especially with the cap.
There are two possibilities here:
There is no solution, no condition satisfies the requirement.
There is a solution that we need to be optimized.
We really want to try to find a solution first, if we can find one (regardless of the amount of waste), then that's nice.
So let's reframe the problem: We aim to simply minimize waste, but we also need to meet a min requirement. So let's try to get as much waste as possible in ways we need it.
I'm going to propose an algorithm you could use that should work "fairly well" and is polynomial time, though could probably have some optimizations. I'll be using K to mean mandatoryAmount as it's a bit of a customary variable in this situation. Also I'll be using N to mean the number of lists. Lastly, Z to represent the total number of items (across all lists).
Get the list of all items and sort them by the amount of each value they have (first the goal values, then the bonus values). If an item has 100A, 300C, 200B, 400D, 150E and the required are [B, D], then the sort order would look like: [400,200,300,150,100]. Repeat but for one goal value. Using the same example above we would have: [400,300,150,100] for goal: D and [200,300,150,100] for goal B. Create a boolean variable for optimization mode (we start by seeking for a solution, once we find one, we can try to optimize it). Create a counter/hash to contain the unassigned items. An item cannot be unassigned more than K times (to avoid infinite loops). This isn't strictly needed, but could work as an optimization for step 5, as it prioritize goals you actually need.
For each list, keep a counter of the number of assignable slots for each list, set each to K, as well as the number of total assignable slots, and set to K * N. This will be adjusted as needed along the way. You want to be able to quickly O(1) lookup for: a) which list an (sorted) item belongs to, b) how many available slots that item has, and c) How many times has the item been unassigned, d) Find the item is the sorted list.
General Assignment. While there are slots available (total slots), go through the list from highest to lowest order. If the list for that item is available, assign as many slots as possible to that item. Update the assignable and total slots. If result is a valid solution, record it, trip the "optimization mode flag". If slots remain unassigned, revert the previous unassignment (but do not change the assignment count).
Waste Optimization. Find the most wasteful item that can be unassigned (unassigned count < K). Unassign one slot of it. If in optimization mode, do not allow any of the goal values to go below their cap (skip if it would). Update the unassigned count for item. Goto #3, but start just after the wasteful item. If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
Goal value Optimization. Skip if current state is a valid solution. Find the value furthest from it's goal (IE: A/B/C/D/E above) that can be unassigned. Unassign one slot for that item. Update assignment count. Goto step 3, begin search at start of list (unlike Step 4), stop searching the list if you go below the value of this item (not this item itself, as others may have the same value). If no assignment made, reassign this item until the list has no remaining assignments, but do not update the unassigned count (otherwise we might end up in an invalid state).
No Assignments remain. Return current state as "best solution found".
Algorithm should end with the "best" solution that this approach can come up with. Increasing max unassignment counts may improve the solution, decreasing max assignment counts will speed up the algorithm. Algorithm will run until it has maxed out it's assignment counts.
This is a bit of a greedy algorithm, so I'm not sure it's optimal (in that it will always yield the best result) but it may give you some ideas as to how to approach it. It also feels like it should yield fairly good results, as it basically trying to bound the results. Algorithm performance is something like O(Z^2 * K), where K is the mandatoryAmount and Z is the total number of items. Each item is unassigned K items, and potentially each assignment also requires O(Z) checks before it is reassigned.
As an optimization, use a O(log N) or better delete/next operation sorted data structure to store the sorted lists. Doing so it would make it practical to delete items from the assignment lists once the unassignment count reaches K (rendering them no longer assignable) allowing for O(Z * log(Z) * K) performance instead.
Edit:
Hmmm, the above only works within a single list (IE: Item removed can only be added to it's own list, as only that list has room). To avoid this, do step 4 (remove too heavy) then step 5 (remove too light) and then goto step 3 (using step 5's rules for searching, but also disallow adding back the too heavy ones).
So basically we remove the heaviest one then the lightest one then we try to assign something that is as heavy as possible to make up for the lightest one we removed.

Student Council Election

Student council elections are work in an odd manner. Each candidate is assigned a unique
identification number. The University is divided into five zones and each zone proposes a
list of candidates that it would like to nominate to the Council. Any candidate who is
proposed by three or more zones is elected. There is no lower limit or upper limit on the
size of the Council. Design an algorithm to take proposed list of candidate from all five
zones as input (in sorted order) and calculate how many candidates are elected to the
Council. Illustrate your algorithm for the following example:
Suppose the candidates proposed by the five zones are:
Zone 1: [5,12,15,62,87]
Zone 2: [7,14,48,62,87,92]
Zone 3: [5,12,14,87]
Zone 4: [12,17,49,52,92,98]
Zone 5: [5,12,14,87,92]
I think the hint here is sorted order but i couldn't find any ways to approach this problem.If anyone come up with solution please post it. Thank you.
I have a simple idea.
Initialize a HashMap(key, value) with the key representing the candidate id and value representing the number of proposed.
Loop for each element of each zone.
If the element is not yet appeared, append new (key, value) with value 1, else increase value by 1.
Finally, you can check in hashmap and elect key with equal or greater than 3.
So, you can follow my Pseudocode
Hashmap map = new HashMap<int, int>()
ForEach z in ZoneLine
ForEach e in z
If containKey(e)
put map[e] <- 1
Else map[e]+=1
ForEach m in Map
If map[m] >= 3
return m
Hope this helps.

Algorithm for grouping train trips

Imagine you have a full calendar year in front of you. On some days you take the train, potentially even a few times in a single day and each trip could be to a different location (I.E. The amount you pay for the ticket can be different for each trip).
So you would have data that looked like this:
Date: 2018-01-01, Amount: $5
Date: 2018-01-01, Amount: $6
Date: 2018-01-04, Amount: $2
Date: 2018-01-06, Amount: $4
...
Now you have to group this data into buckets. A bucket can span up to 31 consecutive days (no gaps) and cannot overlap another bucket.
If a bucket has less than 32 train trips it will be blue. If it has 32 or more train trips in it, it will be red. The buckets will also get a value based on the sum of the ticket cost.
After you group all the trips the blue buckets get thrown out. And the value of all the red buckets gets summed up, we will call this the prize.
The goal, is to get the highest value for the prize.
This is the problem I have. I cant think of a good algorithm to do this. If anyone knows a good way to approach this I would like to hear it. Or if you know of anywhere else that can help with designing algorithms like this.
This can be solved by dynamic programming.
First, sort the records by date, and consider them in that order.
Let day (1), day (2), ..., day (n) be the days where the tickets were bought.
Let cost (1), cost (2), ..., cost (n) be the respective ticket costs.
Let fun (k) be the best prize if we consider only the first k records.
Our dynamic programming solution will calculate fun (0), fun (1), fun (2), ..., fun (n-1), fun (n), using the previous values to calculate the next one.
Base:
fun (0) = 0.
Transition:
What is the optimal solution, fun (k), if we consider only the first k records?
There are two possibilities: either the k-th record is dropped, then the solution is the same as fun (k-1), or the k-th record is the last record of a bucket.
Let us then consider all possible buckets ending with the k-th record in a loop, as explained below.
Look at records k, k-1, k-2, ..., down to the very first record.
Let the current index be i.
If the records from i to k span more than 31 consecutive days, break from the loop.
Otherwise, if the number of records, k-i+1, is at least 32, we can solve the subproblem fun (i-1) and then add the records from i to k, getting a prize of cost (i) + cost (i+1) + ... + cost (k).
The value fun (k) is the maximum of these possibilities, along with the possibility to drop the k-th record.
Answer: it is just fun (n), the case where we considered all the records.
In pseudocode:
fun[0] = 0
for k = 1, 2, ..., n:
fun[k] = fun[k-1]
cost_i_to_k = 0
for i = k, k-1, ..., 1:
if day[k] - day[i] > 31:
break
cost_i_to_k += cost[i]
if k-i+1 >= 32:
fun[k] = max (fun[k], fun[i-1] + cost_i_to_k)
return fun[n]
It is not clear whether we are allowed to split records on a single day into different buckets.
If the answer is no, we will have to enforce it by not considering buckets starting or ending between records in a single day.
Technically, it can be done by a couple of if statements.
Another way is to consider days instead of records: instead of tickets which have day and cost, we will work with days.
Each day will have cost, the total cost of tickets on that day, and quantity, the number of tickets.
Edit: as per comment, we indeed can not split any single day.
Then, after some preprocessing to get days records instead of tickets records, we can go as follows, in pseudocode:
fun[0] = 0
for k = 1, 2, ..., n:
fun[k] = fun[k-1]
cost_i_to_k = 0
quantity_i_to_k = 0
for i = k, k-1, ..., 1:
if k-i+1 > 31:
break
cost_i_to_k += cost[i]
quantity_i_to_k += quantity[i]
if quantity_i_to_k >= 32:
fun[k] = max (fun[k], fun[i-1] + cost_i_to_k)
return fun[n]
Here, i and k are numbers of days.
Note that we consider all possible days in the range: if there are no tickets for a particular day, we just use zeroes as its cost and quantity values.
Edit2:
The above allows us to calculate the maximum total prize, but what about the actual configuration of buckets which gets us there?
The general method will be backtracking: at position k, we will want to know how we got fun (k), and transition to either k-1 if the optimal way was to skip k-th record, or from k to i-1 for such i that the equation fun[k] = fun[i-1] + cost_i_to_k holds.
We proceed until i goes down to zero.
One of the two usual implementation approaches is to store par (k), a "parent", along with fun (k), which encodes how exactly we got the maximum.
Say, if par (k) = -1, the optimal solution skips k-th record.
Otherwise, we store the optimal index i in par (k), so that the optimal solution takes a bucket of records i to k inclusive.
The other approach is to store nothing extra.
Rather, we run a slight modification code which calculates fun (k).
But instead of assigning things to fun (k), we compare the right part of the assignment to the final value fun (k) we already got.
As soon as they are equal, we found the right transition.
In pseudocode, using the second approach, and days instead of individual records:
k = n
while k > 0:
k = prev (k)
function prev (k):
if fun[k] == fun[k-1]:
return k-1
cost_i_to_k = 0
quantity_i_to_k = 0
for i = k, k-1, ..., 1:
if k-i+1 > 31:
break
cost_i_to_k += cost[i]
quantity_i_to_k += quantity[i]
if quantity_i_to_k >= 32:
if fun[k] == fun[i-1] + cost_i_to_k:
writeln ("bucket from $ to $: cost $, quantity $",
i, k, cost_i_to_k, quantity_i_to_k)
return i-1
assert (false, "can't happen")
Simplify the challenge, but not too much, to make an overlookable example, which can be solved by hand.
That helps a lot in finding the right questions.
For example take only 10 days, and buckets of maximum length of 3:
For building buckets and colorizing them, we need only the ticket count, here 0, 1, 2, 3.
On Average, we need more than one bucket per day, for example 2-0-2 is 4 tickets in 3 days. Or 1-1-3, 1-3, 1-3-1, 3-1-2, 1-2.
But We can only choose 2 red buckets: 2-0-2 and (1-1-3 or 1-3-3 or 3-1-2) since 1-2 in the end is only 3 tickets, but we need at least 4 (one more ticket than max day span per bucket).
But while 3-1-2 is obviously more tickets than 1-1-3 tickets, the value of less tickets might be higher.
The blue colored area is the less interesting one, because it doesn't feed itself, by ticket count.

Divide a group of people into two disjoint subgroups (of arbitrary size) and find some values

As we know from programming, sometimes a slight change in a problem can
significantly alter the form of its solution.
Firstly, I want to create a simple algorithm for solving
the following problem and classify it using bigtheta
notation:
Divide a group of people into two disjoint subgroups
(of arbitrary size) such that the
difference in the total ages of the members of
the two subgroups is as large as possible.
Now I need to change the problem so that the desired
difference is as small as possible and classify
my approach to the problem.
Well,first of all I need to create the initial algorithm.
For that, should I make some kind of sorting in order to separate the teams, and how am I suppose to continue?
EDIT: for the first problem,we have ruled out the possibility of a set being an empty set. So all we have to do is just a linear search to find the min age and then put it in a set B. SetA now has all the other ages except the age of setB, which is the min age. So here is the max difference of the total ages of the two sets, as high as possible
The way you described the first problem, it is trivial in the way that it requires you to find only the minimum element (in case the subgroups should contain at least 1 member), otherwise it is already solved.
The second problem can be solved recursively the pseudo code would be:
// compute sum of all elem of array and store them in sum
min = sum;
globalVec = baseVec;
fun generate(baseVec, generatedVec, position, total)
if (abs(sum - 2*total) < min){ // check if the distribution is better
min = abs(sum - 2*total);
globalVec = generatedVec;
}
if (position >= baseVec.length()) return;
else{
// either consider elem at position in first group:
generate(baseVec,generatedVec.pushback(baseVec[position]), position + 1, total+baseVec[position]);
// or consider elem at position is second group:
generate(baseVec,generatedVec, position + 1, total);
}
And now just start the function with generate(baseVec,"",0,0) where "" stand for an empty vector.
The algo can be drastically improved by applying it to a sorted array, hence adding a test condition to stop branching, but the idea stays the same.

Generating a unique ID with O(1) space?

We have a group of objects, let's call them Players. We can traverse through this group only with random order, e.g. there is no such thing as Players[0].
Each Player has a unique ID, with ID < len(Players). Player's can be added and removed to the group. When a Player gets removed it will free his ID, and if a Player gets added it will acquire an ID.
If we want to add a new Player to Players we have to generate a new unique ID. What is the fastest way to generate such ID in O(1) space?
O(n log n) is possible with binary search. Start with a = 0 and b = n. The invariant is that there exists a free id in the interval [a, b). Repeat the following until b - a = 1: let m = a + floor((b - a) / 2), count the number of ids in [a, m) and in [m, b). If [a, m) has fewer than m - a ids, then set b = m. Otherwise, set a = m.
I think you can use a Queue to enqueue the IDs that have been free'd up. Dequeue the queue to get free IDs once you have used up the highest possible ID. This will take O(1).
int highestIndex = 0;
Adding Players
if (highestIndex < len(Players)-1){
ID = ++highestIndex();
}
else if (!queue.isEmpty()){
ID = queue.dequeue();
} else{
// max players reached
}
Removing Players
queue.enqueue(ID);
Keep a boolean array. Construct a binary tree over this array, such that the leafs are the initial values in the array, and for items i, i+1 the parent is their logical AND (this means one of them is 0). When you want to insert traverse the tree from the root down to find the first empty slot (keep going left while one child is 0). This gives the first empty slot in O(log(n)). You can get O(log(log(n)) if you take each sqrt(n) group of bits and form an AND parent.
Based on question as first posed with a fixed maximum number of Players:
1) Technically the size of Players is O(1). Build a boolen array of 1000 slots, one per player, with TRUE meaning "ID is assigned". When a player dies, set the ID for his bit to false. When a new player arrives, search the bit array for a "false" bit; assign that ID to the player and set the bit.
Time is O(1), too with a big constant.
Based on question as revised with arbitrary N players:
2) Expanding Holzer's idea: keep a small fixed size array of size k < < N as a cache of free IDs. Use it the way TMJ described. [TMJ deleted his answer: it said in effect, "keep a stack of unused IDs, pop an unused one, push newly dead ones"] If the cache is empty when a new ID is needed, apply Holzer's scheme (one could even refill the small array while executing Holzer's scheme). [Sheesh, Holzer deleted his answer too, it said "try each ID in order and search the set; if nobody has that ID, use it" (O(N^2)] If the number of players arrives at more or less a steady state, this would be pretty fast because statistically there would always be some values in the fixed size array.
You can combine TMJ's idea with Per's idea, but you can't refill the array during Per's scan, only with dead player IDs.
You could put the players in a (cyclic) linked list. Deleting a player would involve cutting it out of the chain, and inserting it into another list (the "free" list). Allocating a player would cut (a random) one out of the "free" list and insert it into the "active" list.
UPDATE:
Since the array is fixed, you can use a watermark separating the allocated from the free players:
Initially {watermark = 0}
Free: {swap [this] <--> [watermark -1] ; decrement watermark; }
Allocate: {increment watermark; yield warermark-1; }
Voila!
Your question is ill-formed. The immediate answer is:
ID(newPlayer) = 1000
(You stated no requirement that the new player ID have to be less than 1000.)
More seriously, since O(1000) == O(1), you can create an array of id_seen[1000], mark all IDs you've seen so far in it, than select one you have not seen.
To make your question interesting, you have to formulate it carefully, e.g. "there are N players with IDs < K. You can only traverse the collection in unknown order. Add a new player with ID < K, using O(1) space."
One (inefficient) answer: select random number X < K. Traverse the collection. If you see a player with ID == X, restart. If you don't, use it as the new ID.
Evaluating efficiency of this algorithm for a given N and K is left as an exercise to the reader ;-)

Resources