Critical path with given start and end times instead of cost - algorithm

I'm having trouble computing the critical path of a network of activities. The data I have to work with is a little different than what I have seen in simple examples on the web. That is I have the start and end times of each activities, from whom I deduce the length. I have been using that algorithm to compute earliest and latest start and end times for each activity :
"To find the earliest start times, start at activities with no predecessors, and say they start at time zero. Then repeatedly find an activity whose predecessors' start times have all been filled, and set the start time to the maximum predecessor finish time.
To find the latest start times, run the preceding algorithm backwards. Start at activities with no successors. Set their finish time to the maximum finish time from the previous phase. Repeatedly find a predecessor whose successors have all been evaluated. Set its finish time to the earliest successor's start time.
Now it is trivial to evaluate slack = latest start – earliest start. Some chain of events will have slack time equal to zero; this is the critical path."
source : https://stackoverflow.com/questions/6368404/find-the-critical-path-and-slack-time
My code is identifying sometimes correctly the critical activities which compose the critical path, but due to the data I have, it sometimes fail. I have found when it does happen : it's when the given times for an activity (from whom cost is deduced) does not respect the early and latest times computed. Right now I only take into account the cost of each activity, but obviously it is not enough because in case like the one in picture below, critical path computed is not accurate :
one fail case for algorithm above http://imageshack.us/a/img688/2420/casemp.png
Obviously activity B is critical (if its end time is shifted, the end of project is also shifted) but the algorithm compute a slack of 1...
I don't know how to change the algorithm to make it work for case above.

I found some easy way to identify the critical activities from the data I have. For each activity, I simulate one second delay (add one second to the end time) and propagate that delay to all successors and test if it affects the time of last activity. If so, that task is critical.
This way works great, I have now a list of all the critical activities, but for some case it can take several seconds (23 seconds for 450 activities with a lot of dependances !). So still trying to find a better way.

Related

Map Reduce Job to find the popular items in a time window

I was asked this question in an interview, and I'm not sure if I gave the proper answer, so I would like some insights.
The problem: There is a stream of users and items. At each minute, I receive a list of tuples (user, item), representing that a user u consumed item i. I need to find the top 100 popular items in the past hour, i.e., calculate how many users consumed each item and sort them. The trick here is that in the past hour, if an item is consumed by the same user more than once, only 1 consumption is considered. No repeated consumption by the same user is allowed. The interviewer said that I should think big and there would be millions of consumptions per hour. So, he suggested me to do a map-reduce job or something that can deal with this large amount of data per minute.
The solution I came up with: I said that I could maintain a list (or a matrix if you prefer) of the consumed user-item-timestamp tuples, as if there was a time-window shifting. Something like:
u1,i1,t1
u1,i2,t1
u2,i2,t2... and so on.
At each minute, when I receive the stream of user-items consumption for this minute, I first make a map-reduce job to update the time-window matrix, with the current timestamp. This map-reduce job could be done by two mappers (one for the stream and the other for the time-window list), and the reducer would simply get the maximum for each pair. A pseudo-code for what I did:
mapTimeWindow(line):
user, item, timestamp = line.split(" ")
context.write(key=(user,item), value=timestamp)
mapStream(line):
user, item = line.split(" ")
context.write(key=(user,item), value=now())
reducer(key, list):
context.write(key=(user,item), value=max(list))
Next, I also do a map-reduce to calculate the popularity by calculating the times that each user appear in that list. My map reads for the updated time window list and write item and 1. The reducer calculates the sum of the list for each item. Since I am storing all the timestamp, I verify if the consumption is in the past hour or not. Another map-reduce pseudo-code:
mapPopularity(line):
user, item, timestamp = line.split(" ")
if now()-60>timestamp:
return
context.write(key=item, value=1) # no repetition
reducerPopularity(key, list):
context.write(key=item, value=sum(list))
Later we can do another map-reduce to read from the result of the second job and calculate the top100 largest items. Something done like this.
My question: is this solution acceptable for the interview I had? It contains three map-reduces to solve the problem. However, it seems to me to be quite a lot to execute at each minute. Since it needs to be updated at every minute, it cannot last longer than that. I mean, I put quite a lots of efforts into it, but the interviewer didn't give me a feedback if it is right or not. I would to know: is it possible to make it faster? Or is it possible to deal with this in another way? (maybe not map-reduce)
Telling if your solution is acceptable or not, is ultimately an opinion. The interviewer could appreciate your algorithm or perhaps your problem solving process and your thinking. Only your interviewer can ultimately tell. Your solution certaintly follows a logic and does the job, if the algorithm you wrote is implemented in a complete and correct way.
My solution:
As you explained, the main concern is performance, since we have big data, so we shall reduce space complexity, time complexity and number of executions by keeping it to the least amount necessary.
Space complexity
I would keep one list of [user,timestamp] per item (or more performant collection depending on the libraries you use but I will keep it base-case here. See dict note at the end). Every new item has its own list. This essentially is better than an overall [user, timestamp,item] because that is worse in memory usage due to the extra field and requiring an additional map operation or maybe just filtering because you have to process all associations existing to extract those "per item". More easily, you can get the list for that item "by hash" or by reference in the code. This model is the minimalistic one.
Time complexity
That said, there is the purge operation and the popularity extraction. Since we want to limit hits, but you must check timestamps every time you calculate current popularity due to specifics, you must scroll your list requiring complexity of O(n).
Therefore: Filter by current time <60 the way you did. This will purge expired associations. Then simply len(list_of_that_item). Complexity O(1). Done.
Since the linear search cost is paid by the filtering, a reduce operation would pay a similar cost if you want to count the non expired entries without purging. If and only if deleting from the list has a bigger overhead, you may want to benchmark a non-deleting algorithm that keeps associations "forever" and you manually schedule purging operations. Although the previous solution should perform better, it is worth mentioning for completeness.
Insertion
If you use dicts it's trivial (and more performant too). Updating the timestamp or inserting if not present are the same code: strawberry["Mike"]=timestamp. Moreover the overall associations set is a dict with key=item and value=per_item_dict and per_item_dict has key=user value=timestamp. Therefore data[strawberry]["Mike"]=timestamp
Edit: adding some more code
Purge
data[strawberry] = {k: v for k, v in data[strawberry].items() if your_time_condition_expression}
Popularity check
After purge: len(data[strawberry])

Algorithm for animating elements running across a scene

I'm not sure if the title is right but...
I want to animate (with html + canvas + javascript) a section of a road with a given density/flow/speed configuration. For that, I need to have a "source" of vehicles in one end, and a "sink" in the other end. Then, a certain parameter would determine how many vehicles per time unit are created, and their (constant) speed. Then, I guess I should have a "clock" loop, to increment the position of the vehicles at a given frame-rate. Preferrably, a user could change some values in a form, and the running animation would update accordingly.
The end result should be a (much more sophisticated, hopefully) variation of this (sorry for the blinking):
Actually this is a very common problem, there are thousands of screen-savers that use this effect, most notably the "star field", which has parameters for star generation and star movement. So, I believe there must be some "design pattern", or more widespread form (maybe even a name) for this algoritm. What would solve my problem would be some example or tutorial on how to achieve this with common control flows (loops, counters, ifs).
Any idea is much appreciated!
I'm not sure of your question, this doesn't seem an algorithm question, more like programming advice. I have a game which needs exactly this (for monsters not cars), this is what I did. It is in a sort of .Net psuedocode but similar stuff exists in other environments.
If you are running an animation by hand, you essentially need a "game-loop".
while (noinput):
timenow = getsystemtime();
timedelta = timenow - timeprevious;
update_object_positions(timedelta);
draw_stuff_to_screen();
timeprevious = timenow;
noinput = check_for_input()
The update_object_positions(timedelta) moves everything along timedelta, which is how long since this loop last executed. It will run flat-out redrawing every timedelta. If you want it to run at a constant speed, say once every 20 mS, you can stick in a thread.sleep(20-timedelta) to pad out the time to 20mS.
Returning to your question. I had a car class that included its speed, lane, type etc as well as the time it appears. I had a finite number of "cars" so these were pre-generated. I held these in a list which I sorted by the time they appeared. Then in the update_object_position(time) routine, I saw if the next car had a start time before the current time, and if so I popped cars off the list until the first (next) car had a start time in the future.
You want (I guess) an infinite number of cars. This requires only a slight variation. Generate the first car for each lane, record its start time. When you call update_object_position(), if you start a car, find the next car for that lane and its time and make that the next car. If you have patterns that you want to repeat, generate the whole pattern in one go into a list, and then generate a new pattern when that list is emptied. This would also work well in terms of letting users specify variable pattern flows.
Finally, have you looked at what happens in real traffic flows as the volume mounts? Random small braking activities cause cars behind to slightly over-react, and as the slight over-reactions accumulate it turns into cars completely stopping a kilometre back up the road. Its quite strange, and so might be a great effect in your wallpaper/screensaver whatever as well as being a proper simulation.

Data model to use for a DVR's recording schedule

A DVR needs to store a list of programs to record. Each program has a starting time and duration. This data needs to be stored in a way that allows the system to quickly determine if a new recording request conflicts with existing scheduled recordings.
The issue is that merely looking to see if there is a show with a conflicting start time is inadequate because the end of a longer program can overlap with a shorter one. I suppose one could create a data structure that tracked the availability of each time slice, perhaps at half-hour granularity, but this would fail if we cannot assume all shows start and end at the half-hour boundary, and tracking at the minute level seems inefficient, both in storage and look up.
Is there a data structure that allows one to query by range, where you supply the lower and upper bound and it returns a collection of all elements that fall within or overlap that range?
An interval tree (maybe using the augmented tree data structure?) does exactly what you're looking for. You'd enter all scheduled recordings into the tree and when a new request comes in, check whether it overlaps any of the existing intervals. Both this lookup and adding a new request take O(log(n)) time, where n is the number of intervals currently stored.

A Greedy algorithm for k-limited resources

I am studying greedy algorithms and I am wondering the solution for a different case.
For interval selection problem we want to pick the maximum number of activities that do not clash with each other, so selecting the job with the earliest finishing time works.
Another example; we have n jobs given and we want to buy as smallest number of resources as possible. Here, we can sort all the jobs from left to right, and when we encounter a new startpoint, we increment a counter and when we encounter an endpoint, we decrement the counter. So the largest value we get from this counter will be number of resources we need to buy.
But for example, what if we have n tasks but k resources? What if we cannot afford more then k resource? How should be a greedy solution to remove as few tasks as possible to satisfy this?
Also if there is a specific name for the last problem I wrote, I would be happy to hear that.
This looks like a general case of the version where we have only one resource.
Intuitively, it makes sense to still sort the jobs by end time and take them one by one in that order. Now, instead of the ending time of the last job, we keep track of the ending times of the last k jobs accepted into our resources. For each job, we check if the current jobs starting time is greater that the last job in any one of our resources. If no such resource is found, we skip that job and move ahead. If one resource is found, we assign that job to that resource and update ending time. If there are more than one resource able to take on that job, it makes sense to assign it to the resource with the latest end time.
I don't really have a proof of this greedy strategy, so it may well be wrong. But I cannot think of a case where changing the choice might enable us to fit more jobs.

load balancing algorithms - special example

Let´s pretend i have two buildings where i can build different units in.
A building can only build one unit at the same time but has a fifo-queue of max 5 units, which will be built in sequence.
Every unit has a build-time.
I need to know, what´s the fastest solution to get my units as fast as possible, considering the units already in the build-queues of my buildings.
"Famous" algorithms like RoundRobin doesn´t work here, i think.
Are there any algorithms, which can solve this problem?
This reminds me a bit of starcraft :D
I would just add an integer to the building queue which represents the time it is busy.
Of course you have to update this variable once per timeunit. (Timeunits are "s" here, for seconds)
So let's say we have a building and we are submitting 3 units, each take 5s to complete. Which will sum up to 15s total. We are in time = 0.
Then we have another building where we are submitting 2 units that need 6 timeunits to complete each.
So we can have a table like this:
Time 0
Building 1, 3 units, 15s to complete.
Building 2, 2 units, 12s to complete.
Time 1
Building 1, 3 units, 14s to complete.
Building 2, 2 units, 12s to complete.
And we want to add another unit that takes 2s, we can simply loop through the selected buildings and pick the one with the lowest time to complete.
In this case this would be building 2. This would lead to Time2...
Time 2
Building 1, 3 units, 13s to complete
Building 2, 3 units, 11s+2s=13s to complete
...
Time 5
Building 1, 2 units, 10s to complete (5s are over, the first unit pops out)
Building 2, 3 units, 10s to complete
And so on.
Of course you have to take care of the upper boundaries in your production facilities. Like if a building has 5 elements, don't assign something and pick the next building that has the lowest time to complete.
I don't know if you can implement this easily with your engine, or if it even support some kind of timeunits.
This will just result in updating all production facilities once per timeunit, O(n) where n is the number of buildings that can produce something. If you are submitting a unit this will take O(1) assuming that you keep the selected buildings in a sorted order, lowest first - so just a first element lookup. In this case you have to resort the list after manipulating the units like cancelling or adding.
Otherwise amit's answer seem to be possible, too.
This is NPC problem (proof at the end of the answer) so your best hope to find ideal solution is trying all possibilities (this will be 2^n possibilities, where n is the number of tasks).
possible heuristic was suggested in comment (and improved in comments by AShelly): sort the tasks from biggest to smallest, and put them in one queue, every task can now take element from the queue when done.
this is of course not always optimal, but I think will get good results for most cases.
proof that the problem is NPC:
let S={u|u is a unit need to be produced}. (S is the set containing all 'tasks')
claim: if there is a possible prefect split (both queues finish at the same time) it is optimal. let this time be HalfTime
this is true because if there was different optimal, at least one of the queues had to finish at t>HalfTime, and thus it is not optimal.
proof:
assume we had an algorithm A to produce the best solution at polynomial time, then we could solve the partition problem at polynomial time by the following algorithm:
1. run A on input
2. if the 2 queues finish exactly at HalfTIme - return True.
3. else: return False
this solution solves the partition problem because of the claim: if the partition exist, it will be returned by A, since it is optimal. all steps 1,2,3 run at polynomial time (1 for the assumption, 2 and 3 are trivial). so the algorithm we suggested solves partition problem at polynomial time. thus, our problem is NPC
Q.E.D.
Here's a simple scheme:
Let U be the list of units you want to build, and F be the set of factories that can build them. For each factory, track total time-til-complete; i.e. How long until the queue is completely empty.
Sort U by decreasing time-to-build. Maintain sort order when inserting new items
At the start, or at the end of any time tick after a factory completes a unit runs out of work:
Make a ready list of all the factories with space in the queue
Sort the ready list by increasing time-til-complete
Get the factory that will be done soonest
take the first item from U, add it to thact factory
Repeat until U is empty or all queues are full.
Googling "minimum makespan" may give you some leads into other solutions. This CMU lecture has a nice overview.
It turns out that if you know the set of work ahead of time, this problem is exactly Multiprocessor_scheduling, which is NP-Complete. Apparently the algorithm I suggested is called "Longest Processing Time", and it will always give a result no longer than 4/3 of the optimal time.
If you don't know the jobs ahead of time, it is a case of online Job-Shop Scheduling
The paper "The Power of Reordering for Online Minimum Makespan Scheduling" says
for many problems, including minimum
makespan scheduling, it is reasonable
to not only provide a lookahead to a
certain number of future jobs, but
additionally to allow the algorithm to
choose one of these jobs for
processing next and, therefore, to
reorder the input sequence.
Because you have a FIFO on each of your factories, you essentially do have the ability to buffer the incoming jobs, because you can hold them until a factory is completely idle, instead of trying to keeping all the FIFOs full at all times.
If I understand the paper correctly, the upshot of the scheme is to
Keep a fixed size buffer of incoming
jobs. In general, the bigger the
buffer, the closer to ideal
scheduling you get.
Assign a weight w to each factory according to
a given formula, which depends on
buffer size. In the case where
buffer size = number factories +1, use weights of (2/3,1/3) for 2 factories; (5/11,4/11,2/11) for 3.
Once the buffer is full, whenever a new job arrives, you remove the job with the least time to build and assign it to a factory with a time-to-complete < w*T where T is total time-to-complete of all factories.
If there are no more incoming jobs, schedule the remainder of jobs in U using the first algorithm I gave.
The main problem in applying this to your situation is that you don't know when (if ever) that there will be no more incoming jobs. But perhaps just replacing that condition with "if any factory is completely idle", and then restarting will give decent results.

Resources