A Greedy algorithm for k-limited resources - algorithm

I am studying greedy algorithms and I am wondering the solution for a different case.
For interval selection problem we want to pick the maximum number of activities that do not clash with each other, so selecting the job with the earliest finishing time works.
Another example; we have n jobs given and we want to buy as smallest number of resources as possible. Here, we can sort all the jobs from left to right, and when we encounter a new startpoint, we increment a counter and when we encounter an endpoint, we decrement the counter. So the largest value we get from this counter will be number of resources we need to buy.
But for example, what if we have n tasks but k resources? What if we cannot afford more then k resource? How should be a greedy solution to remove as few tasks as possible to satisfy this?
Also if there is a specific name for the last problem I wrote, I would be happy to hear that.

This looks like a general case of the version where we have only one resource.
Intuitively, it makes sense to still sort the jobs by end time and take them one by one in that order. Now, instead of the ending time of the last job, we keep track of the ending times of the last k jobs accepted into our resources. For each job, we check if the current jobs starting time is greater that the last job in any one of our resources. If no such resource is found, we skip that job and move ahead. If one resource is found, we assign that job to that resource and update ending time. If there are more than one resource able to take on that job, it makes sense to assign it to the resource with the latest end time.
I don't really have a proof of this greedy strategy, so it may well be wrong. But I cannot think of a case where changing the choice might enable us to fit more jobs.


Map Reduce Job to find the popular items in a time window

I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights.
The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a user u consumed item i. I need to find the top 100 popular items in the past hour, i.e., calculate how many users consumed each item and sort them. The trick here is that in the past hour, if an item is consumed by the same user more than once, only 1 consumption is considered. No repeated consumption by the same user is allowed. The interviewer said that I should think big and there would be millions of consumptions per hour. So, he suggested me to do a map-reduce job or something that can deal with this large amount of data per minute.
The solution I came up with: I said that I could maintain a list (or a matrix if you prefer) of the consumed user-item-timestamp tuples, as if there was a time-window shifting. Something like:
u2,i2,t2... and so on.
At each minute, when I receive the stream of user-items consumption for this minute, I first make a map-reduce job to update the time-window matrix, with the current timestamp. This map-reduce job could be done by two mappers (one for the stream and the other for the time-window list), and the reducer would simply get the maximum for each pair. A pseudo-code for what I did:
user, item, timestamp = line.split(" ")
context.write(key=(user,item), value=timestamp)
user, item = line.split(" ")
context.write(key=(user,item), value=now())
reducer(key, list):
context.write(key=(user,item), value=max(list))
Next, I also do a map-reduce to calculate the popularity by calculating the times that each user appear in that list. My map reads for the updated time window list and write item and 1. The reducer calculates the sum of the list for each item. Since I am storing all the timestamp, I verify if the consumption is in the past hour or not. Another map-reduce pseudo-code:
user, item, timestamp = line.split(" ")
if now()-60>timestamp:
context.write(key=item, value=1) # no repetition
reducerPopularity(key, list):
context.write(key=item, value=sum(list))
Later we can do another map-reduce to read from the result of the second job and calculate the top100 largest items. Something done like this.
My question: is this solution acceptable for the interview I had? It contains three map-reduces to solve the problem. However, it seems to me to be quite a lot to execute at each minute. Since it needs to be updated at every minute, it cannot last longer than that. I mean, I put quite a lots of efforts into it, but the interviewer didn't give me a feedback if it is right or not. I would to know: is it possible to make it faster? Or is it possible to deal with this in another way? (maybe not map-reduce)
Telling if your solution is acceptable or not, is ultimately an opinion. The interviewer could appreciate your algorithm or perhaps your problem solving process and your thinking. Only your interviewer can ultimately tell. Your solution certaintly follows a logic and does the job, if the algorithm you wrote is implemented in a complete and correct way.
My solution:
As you explained, the main concern is performance, since we have big data, so we shall reduce space complexity, time complexity and number of executions by keeping it to the least amount necessary.
Space complexity
I would keep one list of [user,timestamp] per item (or more performant collection depending on the libraries you use but I will keep it base-case here. See dict note at the end). Every new item has its own list. This essentially is better than an overall [user, timestamp,item] because that is worse in memory usage due to the extra field and requiring an additional map operation or maybe just filtering because you have to process all associations existing to extract those "per item". More easily, you can get the list for that item "by hash" or by reference in the code. This model is the minimalistic one.
Time complexity
That said, there is the purge operation and the popularity extraction. Since we want to limit hits, but you must check timestamps every time you calculate current popularity due to specifics, you must scroll your list requiring complexity of O(n).
Therefore: Filter by current time <60 the way you did. This will purge expired associations. Then simply len(list_of_that_item). Complexity O(1). Done.
Since the linear search cost is paid by the filtering, a reduce operation would pay a similar cost if you want to count the non expired entries without purging. If and only if deleting from the list has a bigger overhead, you may want to benchmark a non-deleting algorithm that keeps associations "forever" and you manually schedule purging operations. Although the previous solution should perform better, it is worth mentioning for completeness.
If you use dicts it's trivial (and more performant too). Updating the timestamp or inserting if not present are the same code: strawberry["Mike"]=timestamp. Moreover the overall associations set is a dict with key=item and value=per_item_dict and per_item_dict has key=user value=timestamp. Therefore data[strawberry]["Mike"]=timestamp
Edit: adding some more code
data[strawberry] = {k: v for k, v in data[strawberry].items() if your_time_condition_expression}
Popularity check
After purge: len(data[strawberry])

Coding: Keep track of last N days of records for each user.

Im solving an interesting problem wherein for each user, I would like to keep his last N days of activity. This can be applied to many a use-cases and one such simple one is:
For each user - user can come to gym some random day - I want to get the total number of times he hit the gym over the last 90 days.
This is a tricky one for me.
My thoughts: I thought of storing a vector where each entry would determine a day and then a boolean value might represent his visit. To count, just linear processing of that section in the array would suffice.
What is the best way?
Depending on how complex you need it to be, a simple array that stores each of a clients visits should suffice.
Upon each visit, add a new entry containing the date/time. Each day, run a check to see if any clients contain visit records that are older than 90 days. The first record that is not old enough means there are no more records to check, so you can safely move to the next client.
Hope this helps you!
Make for every client a Queue data structure containing elements with visit date.
When client visits gym, just add current date
When you need to get a track for him:
while (not Q[ClientIdx].Empty) and (Today - Q[ClientIdx].Peek > 90)
Q[ClientIdx].Remove //dequeue too old records
VisitCount = Q.Count
You can use standard Queue implementations in many languages or simple own implementation based on array/list if standard one is not available.
Note that every record is added and removed once, so amortized complexity is O(1) per add/count operation
Your idea will work, but is it really space-wise efficient?
Your data-structure would be something like this: A boolean 2D vector (you can imagine it as a matrix), where every row is a user and every column is a day (sorted), so that would consist of a:
matrix of size U x N
where U is the number of users.
To answer the question I initially asked, you need to think how dense this matrix is going to be. If it's going to be much, then you made the right choice, if not, then you wasted (much) space. You can see the trade-off here.
Of course, you have to think about your use case. In the gym example, I do not think this would be space efficient, since most people do not go to the gym every day (I think), which will result in a sparse matrix, meaning that we wasted space.
Another idea would be to have a single vector os size N, where the days are sorted. Every entry would be a single linked list, where every node would be a user.
If a user is found in the list of a day, then it means that he went to gym at that day.
With this approach we allocate exactly as much space as needed, so it's space optimal, regardless of the density I mentioned in the matrix's case.
However, is this it? No, of course not! I discussed about space, but what about time efficiency? For example, search is a usually frequent method we want our data structure to support, and if we would like that to be fast!
In the matrix's case, search would be an O(1) operation, which is sweet, since accessing the matrix is a constant operation.
In the vector+list's case however, the search would take O(L), where L is the average size of the lists our vector has in total.
So which one? It depends on your application!
I would try a hashtable as well, which would not require sorting and is space efficient (What is the space complexity of a hash table?).
How about having an Queue with fixed size (90) of visit details for every user? You can generalise it for multiple users and key advantage is you don't have to worry about maintaining last 90 days of data.
You can dump a queue to list or array and persist if needed in O(n). And as you have mentioned the check for no.of presence will be O(n) as well.

Job scheduling algorithm with deadline and execution time

Given an array of jobs where every job has a deadline(d_i > 0) and associated execution time (e_i > 0), i.e.
we have been given an array of (d_i, e_i) , can we find an arrangement of jobs such that all of them can be scheduled. There may be more than possible answer, any one will suffice.
e.g. {(3,1),(3,2),(7,3)} {J1,J2,J3}
Answer could be one of them {J1,J2,J3} or {J2,J1,J3}
We can solve this problem using backtracking but running time will be very high. Can we solve this problem using greedy or any other approach? Please provide it's correctness .
At most one job can be run at a time .
Hint: After you have scheduled k initial jobs successfully, it is possible to find a satisfying full schedule only if there is a next job whose execution time added to the current time after the k previous jobs is less than or equal to the deadline time for the next job. Can you see why always choosing the next job with the earliest deadline at each step of choosing a job will determine whether there is or is not a solution, and if there is, will give a precise solution? Let me know if you'd like more details about how to prove that, but hopefully you can see it on your own now that I've pointed out what the correct greedy solution is.
UPDATE: Further hint: Assume that you have a satisfying assignment where two consecutive jobs are out of order according to their deadlines (this just means the overall ordering of jobs is out of order somehow according to deadlines). Then it is possible to finish both of these jobs before the earlier deadline of the two deadlines. Thus it is also possible to finish both jobs before the later deadline, by swapping the jobs, which will still be a satisfying assignment because by assumption you will now finish the earlier deadline job before the previous time you finished it, by assumption, and the later deadline of the two is later than the earlier deadline, and previously it was still possible to find a satisfying assignment.
Thus, if a satisfying assignment exists, then there is another one that exists where jobs are ordered according to their deadlines. I.e., the greedy strategy will always find a satisfying assignment if one exists -- otherwise, there is no solution.
A O(nlogn) Greedy approach based on heap data structure
the input is array of job
struct Job
char id;
int deadLine;
int profit;
Algorithm pseudo code:
1.Sort the input jobArray in non-decresing order of deadLine.
2.create a maxHeap (will consists of job).Basis of Comparison is profit
3.let n=length of jobArray
initialize time=jobArray[n-1].deadLine
4.while index>=0 && jobArray[index].deadLine >= time
4a) insert(maxHeap,jobArray[index])
4b) index=index-1
5. j=removeMax(maxHeap)
print j.id
6.if time > 0
goto step 4.
return ;
This will print jobs in the reverse order .
It can be modified to print in right order;

Assigning jobs to people using Maximum Flow, harder version

I am self studying max flow and there was this problem:
the original problem is
Suppose we have a list of jobs
{J1, J1,..., Jm}
and a list of people that have applied for them
{P1, P2, P3,...,Pn}
each person have different interests and some of them have applied for multiple jobs (each person has a list of jobs they can do)
nobody is allowed to do more than 3 jobs.
so, this problem can be solved by finding a maximum flow in the graph below
I understand this solution, but
the harder version of the problem
what if these conditions are added?
the first 3 conditions of the easy version (lists of jobs and persons and each person has a list of interests or abilities) are still the same
the compony is employing only Vi persons for job Ji
the compony wants to employ as many people as possible
there is no limitations for the number of jobs a person is allowed to do.
What difference should I make in the graph so that my solution can satisfy those conditions as well?or if I need a different approach please tell me.
before anyone say anything, this is not homework. It's just self study, but I am studying Maximum flow and the problem was in that area so the solution should use Maximum flow.
For multiple persons for a single job:
The edge from Ji to t will have the capacity equal to the number of people for that job. E.g. If job #1 can have three people, this translates to a capacity of three for the edge from J1 to t.
For the requirement of hiring as many people as possible:
I don't think this is possible with a single flow-graph. Here is an algorithm for how it could be done:
Run the flow-algorithm once.
For each person:
Try to decrease the incoming capacity to one below the current flow-rate.
Run the flow-algorithm again.
While this does not decrease the total flow, repeat from (2.1.).
Increase the capacity by one, to restore the maximum flow.
Until no further persons gets added, repeat from (2.).
For no limitation on number of jobs:
The edges from s to Pi will have a maximum flow equal the number of applicable jobs for that person.

load balancing algorithms - special example

Let´s pretend i have two buildings where i can build different units in.
A building can only build one unit at the same time but has a fifo-queue of max 5 units, which will be built in sequence.
Every unit has a build-time.
I need to know, what´s the fastest solution to get my units as fast as possible, considering the units already in the build-queues of my buildings.
"Famous" algorithms like RoundRobin doesn´t work here, i think.
Are there any algorithms, which can solve this problem?
This reminds me a bit of starcraft :D
I would just add an integer to the building queue which represents the time it is busy.
Of course you have to update this variable once per timeunit. (Timeunits are "s" here, for seconds)
So let's say we have a building and we are submitting 3 units, each take 5s to complete. Which will sum up to 15s total. We are in time = 0.
Then we have another building where we are submitting 2 units that need 6 timeunits to complete each.
So we can have a table like this:
Time 0
Building 1, 3 units, 15s to complete.
Building 2, 2 units, 12s to complete.
Time 1
Building 1, 3 units, 14s to complete.
Building 2, 2 units, 12s to complete.
And we want to add another unit that takes 2s, we can simply loop through the selected buildings and pick the one with the lowest time to complete.
In this case this would be building 2. This would lead to Time2...
Time 2
Building 1, 3 units, 13s to complete
Building 2, 3 units, 11s+2s=13s to complete
Time 5
Building 1, 2 units, 10s to complete (5s are over, the first unit pops out)
Building 2, 3 units, 10s to complete
And so on.
Of course you have to take care of the upper boundaries in your production facilities. Like if a building has 5 elements, don't assign something and pick the next building that has the lowest time to complete.
I don't know if you can implement this easily with your engine, or if it even support some kind of timeunits.
This will just result in updating all production facilities once per timeunit, O(n) where n is the number of buildings that can produce something. If you are submitting a unit this will take O(1) assuming that you keep the selected buildings in a sorted order, lowest first - so just a first element lookup. In this case you have to resort the list after manipulating the units like cancelling or adding.
Otherwise amit's answer seem to be possible, too.
This is NPC problem (proof at the end of the answer) so your best hope to find ideal solution is trying all possibilities (this will be 2^n possibilities, where n is the number of tasks).
possible heuristic was suggested in comment (and improved in comments by AShelly): sort the tasks from biggest to smallest, and put them in one queue, every task can now take element from the queue when done.
this is of course not always optimal, but I think will get good results for most cases.
proof that the problem is NPC:
let S={u|u is a unit need to be produced}. (S is the set containing all 'tasks')
claim: if there is a possible prefect split (both queues finish at the same time) it is optimal. let this time be HalfTime
this is true because if there was different optimal, at least one of the queues had to finish at t>HalfTime, and thus it is not optimal.
assume we had an algorithm A to produce the best solution at polynomial time, then we could solve the partition problem at polynomial time by the following algorithm:
1. run A on input
2. if the 2 queues finish exactly at HalfTIme - return True.
3. else: return False
this solution solves the partition problem because of the claim: if the partition exist, it will be returned by A, since it is optimal. all steps 1,2,3 run at polynomial time (1 for the assumption, 2 and 3 are trivial). so the algorithm we suggested solves partition problem at polynomial time. thus, our problem is NPC
Here's a simple scheme:
Let U be the list of units you want to build, and F be the set of factories that can build them. For each factory, track total time-til-complete; i.e. How long until the queue is completely empty.
Sort U by decreasing time-to-build. Maintain sort order when inserting new items
At the start, or at the end of any time tick after a factory completes a unit runs out of work:
Make a ready list of all the factories with space in the queue
Sort the ready list by increasing time-til-complete
Get the factory that will be done soonest
take the first item from U, add it to thact factory
Repeat until U is empty or all queues are full.
Googling "minimum makespan" may give you some leads into other solutions. This CMU lecture has a nice overview.
It turns out that if you know the set of work ahead of time, this problem is exactly Multiprocessor_scheduling, which is NP-Complete. Apparently the algorithm I suggested is called "Longest Processing Time", and it will always give a result no longer than 4/3 of the optimal time.
If you don't know the jobs ahead of time, it is a case of online Job-Shop Scheduling
The paper "The Power of Reordering for Online Minimum Makespan Scheduling" says
for many problems, including minimum
makespan scheduling, it is reasonable
to not only provide a lookahead to a
certain number of future jobs, but
additionally to allow the algorithm to
choose one of these jobs for
processing next and, therefore, to
reorder the input sequence.
Because you have a FIFO on each of your factories, you essentially do have the ability to buffer the incoming jobs, because you can hold them until a factory is completely idle, instead of trying to keeping all the FIFOs full at all times.
If I understand the paper correctly, the upshot of the scheme is to
Keep a fixed size buffer of incoming
jobs. In general, the bigger the
buffer, the closer to ideal
scheduling you get.
Assign a weight w to each factory according to
a given formula, which depends on
buffer size. In the case where
buffer size = number factories +1, use weights of (2/3,1/3) for 2 factories; (5/11,4/11,2/11) for 3.
Once the buffer is full, whenever a new job arrives, you remove the job with the least time to build and assign it to a factory with a time-to-complete < w*T where T is total time-to-complete of all factories.
If there are no more incoming jobs, schedule the remainder of jobs in U using the first algorithm I gave.
The main problem in applying this to your situation is that you don't know when (if ever) that there will be no more incoming jobs. But perhaps just replacing that condition with "if any factory is completely idle", and then restarting will give decent results.
