How to find maximum task running at x time? - algorithm

Problem description is follows:
There are n events for particular day d having start time and duration. Example:
e1 10:15:06 11ms (ms = milli seconds)
e2 10:16:07 12ms
......
I need to find out the time x and n. Where x is the time when maximum events were getting executed.
Solution I am thinking is:
Scanning all ms in day d. But that request total 86400000*n calculation. Example
Check at 00::00::00::001 How many events are running
Check at 00::00::00::002 How many events are running
Take max of Range(00::00::00::01,00::00::00::00)
Second solution I am thinking is:
For eventi in all events
Set running_event=1
eventj in all events Where eventj!=eventi
if eventj.start_time in Range (eventi.start_time,eventi.execution_time)
running_event++
And then take max of running_event
Is there any better solution for this?

This can be solved in O(n log n) time:
Make an array of all events. This array is already partially sorted: O(n)
Sort the array: O(n log n); your library should be able to make use of the partial sortedness (timSort does that very well); look into distribution-based sorting algorithms for better expected running time.
Sort event boundaries ascending w.r.t. the boundary time
Sort event ends before sort starts if touching intervals are considered non-overlapping
(Sort event ends after sort starts if touching intervals are considered overlapping)
Initialise running = 0, running_best = 0, best_at = 0
For each event boundary:
If it's a start of an event, increment running
If running > running_best, set best_at = current event time
If it's an end of an event, decrement running
output best_at

You could reduce the number of points you check by checking only ends of all intervals, for each interval (task) I that lasts from t1 to t2, you only need to check how many tasks are running at t1 and at t2 (assuming the tasks runs from t1 to t2 inclusive, if it is exclusive, check for t1-EPSILON, t1+EPSILON, t2-EPSILON, T2+EPSILON.
It is easy to see (convince yourself why) that you cannot get anything better that these cases do not cover.
Example:
tasks run in `[0.5,1.5],[0,1.2],[1,3]`
candidates: 0,0.5,1,1.2,1.5,3
0 -> 1 tasks
0.5 -> 2 tasks
1 -> 3 tasks
1.2 -> 3 tasks (assuming inclusive, end of interval)
1.5 -> 2 tasks (assuming inclusive, end of interval)
3 -> 1 task (assuming inclusive, end of interval)

Related

Algorithm to select a best combination from two list

I have a search result from two way flight. So, there are two lists that contain the departure flights and arrival flights such as:
The departure flights list has 20 flights.
The arrival flights list has 30 flights
So, I will have 600 (20*30) combination between departure flight and arrival flight. I will call the combination list is the result list
However, I just want to select a limitation from 600 combination. For instance, I will select the best of 100 flight combination. The criteria to combine the flights is the cheap price for departure and arrival flight.
To do that, I will sort the result list by the total price of departure and arrival flight. And I then pick up the first 100 elements from result list to get what I want.
But, if the departure flights list has 200 flights and arrival flights list has 300 flights, I will have the result list with 60.000 elements. For that reason, I will sort a list with 60.000 elements to find the best 100 elements.
So, there is any an algorithm to select the best combinations as my case.
Thank you so much.
Not 100% clear from your question, but I understand that you are looking for a faster algorithm to find a certain number of best / cheapest combinations of departure and arrival flights.
You can do this much faster by sorting the lists of departure and arrival flights individually by cost and then using a heap for expanding the next-best combinations one-by-one until you have enough.
Here's the full algorithm -- in Python, but without using any special libraries, just standard data structures, so this should be easily transferable to any other language:
NUM_FLIGHTS, NUM_BEST = 1000, 100
# create test data: each entry corresponds to just the cost of one flight
from random import randint
dep = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)])
arr = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)])
def is_compatible(i, j): # for checking constraints, e.g. timing of flights
return True # but for now, assume no constraints
# get best combination using sorted lists and heap
from heapq import heappush, heappop
heap = [(dep[0] + arr[0], 0, 0)] # initial: best combination from dep and arr
result = [] # the result list
visited = set() # make sure not to add combinations twice
while heap and len(result) < NUM_BEST:
cost, i, j = heappop(heap) # get next-best combination
if (i, j) in visited: continue # did we see those before? skip
visited.add((i, j))
if is_compatible(i, j): # if 'compatible', add to results
result.append((cost, dep[i], arr[j]))
# add 'adjacent' combinations to the heap
if i < len(dep) - 1: # next-best departure + same arrival
heappush(heap, (dep[i+1] + arr[j], i+1, j))
if j < len(arr) - 1: # same departure + next-best arrival
heappush(heap, (dep[i] + arr[j+1], i, j+1))
print result
# just for testing: compare to brute-force (get best from all combinations)
comb = [(d, a) for d in dep for a in arr]
best = sorted((d+a, d, a) for (d, a) in comb)[:NUM_BEST]
print best
print result == best # True -> same results as brute force (just faster)
This works roughly like this:
sort both the departure flights dep and the arrival flights arr by their cost
create a heap and put the best combination (best departure and best arrival) as well as the corresponding indices in their lists into the heap: (dep[0] + arr[0], 0, 0)
repeat until you have enough combinations or there are no more elements in the heap:
pop the best element from the heap (sorted by total cost)
if it satisfies the contraints, add it to the result set
make sure you do not add flights twice to the result set, using visited set
add the two 'adjacent' combinations to the heap, i.e. taking the same flight from dep and the next from arr, and the next from dep and the same from arr, i.e. (dep[i+1] + arr[j], i+1, j) and (dep[i] + arr[j+1], i, j+1)
Here's a very small worked example. The axes are (the costs of) the dep and arr flights, and the entries in the table are in the form n(c)m, where n is the iteration that entry was added to the heap (if it is at all), c is the cost, and m is the iteration it was added to the 'top 10' result list (if any).
dep\arr 1 3 4 6 7
2 0(3)1 1(5)4 4(6)8 8(8)- -
2 1(3)2 2(5)6 6(6)9 9(8)- -
3 2(4)3 3(6)7 7(7)- - -
4 3(5)5 5(7)- - - -
6 5(7)10 - - - -
Result: (1,2), (1,2), (1,3), (3,2), (1,4), (3,2), (3,3), (2,4), (2,4), (1,6)
Note how the sums in both the columns and the rows of the matrix are always increasing, so the best results can always be found in a somewhat triangular area in the top-left. Now the idea is that if your currently best combination (the one that's first in the heap) is dep[i], arr[i], then there's no use in checking, e.g., combination dep[i+2], arr[i] before checking dep[i+1], arr[i], which must have a lower total cost, so you add dep[i+1], arr[i] (and likewise dep[i], arr[i+1]) to the heap, and repeat with popping the next element from the heap.
I compared the results of this algorithm to the results of your brute-force approach, and the resulting flights are the same, i.e. the algorithm works, and always yields the optimal result. Complexity should be O(n log(n)) for sorting the departure and arrival lists (n being the number of flights in those original lists), plus O(m log(m)) for the heap-loop (m iterations with log(m) work per iteration, m being the number of elements in the result list).
This finds the best 1,000 combinations of 100,000 departure and 100,000 arrival flights (for a total of 1,000,000,000,000 possible combinations) in less than one second.
Note that those numbers are for the case that you have no additional constraints, i.e. each departure flight can be combined with each arrival flight. If there are constraints, you can use the is_compatible function sketched in the above code to check those and to skip that pairing. This means, that for each incompatible pair with low total cost, the loop needs one additional iteration. This means that in the worst case, for example if there are no compatible pairs at all, or when the only compatible pairs are those with the highest total cost, the algorithm could in fact expand all the combination.
On average, though, this should not be the case, and the algorithm should perform rather quickly.
I think the best solution would be using some SQL statements to do the Cartesian product. You can apply any kind of filters, based on the data itself, ordering, range selection, etc. Something like this:
SELECT d.time as dep_time, a.time as arr_time, d.price+a.price as total_price
FROM departures d, arrivals a
WHERE a.time > d.time + X
ORDER BY d.price+a.price
LIMIT 0,100
Actually X can be even 0, but arrival should happen AFTER the departure anyways.
Why I would choose SQL:
It's closest to the data itself, you don't have to query them
It's highly optimized, if you use indexes, I'm sure you can't beat its performance with your own code
It's simple and declarative :)

Interview q: Data structure and algorithm for O(1) retrieval of avg. response time in client server architecture

Intv Q:
In a client-server architecture, there are multiple requests from multiple clients to the server. The server should maintain the response times of all the requests in the previous hour. What data structure and algo will be used for this? Also, the average response time needs to be maintained and has to be retrieved in O(1).
My take:
algo: maintain a running mean
mean = mean_prev *n + current_response_time
-------------------------------
n+1
DS: a set (using order statistic tree).
My question is whether there is a better answer. I felt that my answer is very trivial and the answer to the questions(in the interview) before this one and after this one where non trivial.
EDIT:
Based on what amit suggested:
cleanup()
while(queue.front().timestamp-curr_time > 1hr)
(timestamp,val)=queue.pop();
sum=sum-val
n=n-1;
insert(timestamp,value)
queue.push(timestamp,value);
sum=sum+val
n=n+1;
cleanup();
query_average()
cleanup();
return sum/n;
And if we can ensure that cleanup() is triggered once every hour or half an hour, then query_average() will not take very long. But if someone were to implement timer trigger for a function call, how would they do it?
The problem with your solution is it only takes the total average since the beginning of time, and not for the last one hour, as you supposed to.
To do so, you need to maintain 2 variables and a queue of entries (timestamp,value).
The 2 variables will be n (the number of elements that are relevant to the last hours) and sum - the sum of the elements from the last hour.
When a new element arrives:
queue.add(timestamp,value)
sum = sum + value
n = n+1
When you have a query for average:
while (queue.front().timestamp > currentTimeAtamp() - 1 hour):
(timestamp,value) = queue.pop()
sum = sum - value
n = n-1
return sum/n
Note that the above is still O(1) on average, because for every insertion to the queue - you do exactly one deletion. You might add the above loop to the insertion procedure as well.

Transforming set of arbitrary intervals into set of continuous intervals, where possible

I have a practical situation, where I need to minimize amount of data.
Let's say I'm given a set of intervals of normal numbers.
e.g. N1 = {(0,1],(1,2],(3,4]};
I would like to minimize this set to:
N2 = {(0,2],(3,4]};
So basically what I need is to combine multiple small intervals into continuous intervals, where it is possible.
Is there any clever/efficient algorithms for doings this? Because I would like to avoid inefficient for-each-ing.
*If this problem have some wide-known name, please name it in the comments.
This is a sweep-line algorithm.
Split the intervals into start and end points.
Sort the points.
Let count = 0.
Iterate through the points:
Whenever you encounter an end point:
Decrement the count.
If the count = 0, record this point.
Whenever you encounter a start point.
If the count = 0, record this point.
Increment the count.
As a technical note, when sorting, if both a start point and an end point have the same value, put the start point first, otherwise you may record that as a gap, as opposed to a continuous interval.
Example:
(0,1],(1,2],(3,4]
Split 0 start, 1 start, 1 end, 2 end, 3 start, 4 end
Count 1 2 1 0 1 0
Record (0 N/A N/A 2] (3 4]
Getting the recorded values gives us {(0,2], (3,4]}.

Optimized algorithm to schedule tasks with dependency?

There are tasks that read from a file, do some processing and write to a file. These tasks are to be scheduled based on the dependency. Also tasks can be run in parallel, so the algorithm needs to be optimized to run dependent tasks in serial and as much as possible in parallel.
eg:
A -> B
A -> C
B -> D
E -> F
So one way to run this would be run
1, 2 & 4 in parallel. Followed by 3.
Another way could be
run 1 and then run 2, 3 & 4 in parallel.
Another could be run 1 and 3 in serial, 2 and 4 in parallel.
Any ideas?
Let each task (e.g. A,B,...) be nodes in a directed acyclic graph and define the arcs between the nodes based on your 1,2,....
You can then topologically order your graph (or use a search based method like BFS). In your example, C<-A->B->D and E->F so, A & E have depth of 0 and need to be run first. Then you can run F,B and C in parallel followed by D.
Also, take a look at PERT.
Update:
How do you know whether B has a higher priority than F?
This is the intuition behind the topological sort used to find the ordering.
It first finds the root (no incoming edges) nodes (since one must exist in a DAG). In your case, that's A & E. This settles the first round of jobs which needs to be completed. Next, the children of the root nodes (B,C and F) need to be finished. This is easily obtained by querying your graph. The process is then repeated till there are no nodes (jobs) to be found (finished).
Given a mapping between items, and items they depend on, a topological sort orders items so that no item precedes an item it depends upon.
This Rosetta code task has a solution in Python which can tell you which items are available to be processed in parallel.
Given your input the code becomes:
try:
from functools import reduce
except:
pass
data = { # From: http://stackoverflow.com/questions/18314250/optimized-algorithm-to-schedule-tasks-with-dependency
# This <- This (Reverse of how shown in question)
'B': set(['A']),
'C': set(['A']),
'D': set(['B']),
'F': set(['E']),
}
def toposort2(data):
for k, v in data.items():
v.discard(k) # Ignore self dependencies
extra_items_in_deps = reduce(set.union, data.values()) - set(data.keys())
data.update({item:set() for item in extra_items_in_deps})
while True:
ordered = set(item for item,dep in data.items() if not dep)
if not ordered:
break
yield ' '.join(sorted(ordered))
data = {item: (dep - ordered) for item,dep in data.items()
if item not in ordered}
assert not data, "A cyclic dependency exists amongst %r" % data
print ('\n'.join( toposort2(data) ))
Which then generates this output:
A E
B C F
D
Items on one line of the output could be processed in any sub-order or, indeed, in parallel; just so long as all items of a higher line are processed before items of following lines to preserve the dependencies.
Your tasks are an oriented graph with (hopefully) no cycles.
I contains sources and wells (sources being tasks that don't depends (have no inbound edge), wells being tasks that unlock no task (no outbound edge)).
A simple solution would be to give priority to your tasks based on their usefulness (lets call that U.
Typically, starting by the wells, they have a usefulness U = 1, because we want them to finish.
Put all the wells' predecessors in a list L of currently being assessed node.
Then, taking each node in L, it's U value is the sum of the U values of the nodes that depends on him + 1. Put all parents of the current node in the L list.
Loop until all nodes have been treated.
Then, start the task that can be started and have the biggest U value, because it is the one that will unlock the largest number of tasks.
In your example,
U(C) = U(D) = U(F) = 1
U(B) = U(E) = 2
U(A) = 4
Meaning you'll start A first with E if possible, then B and C (if possible), then D and F
first generate a topological ordering of your tasks. check for cycles at this stage. thereafter you can exploit parallelism by looking at maximal antichains. roughly speaking these are task sets without dependencies between their elements.
for a theoretical perspective, this paper covers the topic.
Without considering the serial/parallel aspect of the problem, this code can at least determine the overall serial solution:
def order_tasks(num_tasks, task_pair_list):
task_deps= []
#initialize the list
for i in range(0, num_tasks):
task_deps[i] = {}
#store the dependencies
for pair in task_pair_list:
task = pair.task
dep = pair.dependency
task_deps[task].update({dep:1})
#loop through list to determine order
while(len(task_pair_list) > 0):
delete_task = None
#find a task with no dependencies
for task in task_deps:
if len(task_deps[task]) == 0:
delete_task = task
print task
task_deps.pop(task)
break
if delete_task == None:
return -1
#check each task's hash of dependencies for delete_task
for task in task_deps:
if delete_key in task_deps[task]:
del task_deps[task][delete_key]
return 0
If you update the loop that checks for dependencies that have been fully satisfied to loop through the entire list and execute/remove tasks that no longer have any dependencies all at the same time, that should also allow you to take advantage of completing the tasks in parallel.

Algorithm interview from Google

I am a long time lurker, and just had an interview with Google where they asked me this question:
Various artists want to perform at the Royal Albert Hall and you are responsible for scheduling
their concerts. Requests for performing at the Hall are accommodated on a first come first served
policy. Only one performance is possible per day and, moreover, there cannot be any concerts
taking place within 5 days of each other
Given a requested time d which is impossible (i.e. within 5 days of an already sched-
uled performance), give an O(log n)-time algorithm to find the next available day d2
(d2 > d).
I had no clue how to solve it, and now that the interview is over, I am dying to figure out how to solve it. Knowing how smart most of you folks are, I was wondering if you can give me a hand here. This is NOT for homework, or anything of that sort. I just want to learn how to solve it for future interviews. I tried asking follow up questions but he said that is all I can tell you.
You need a normal binary search tree of intervals of available dates. Just search for the interval containing d. If it does not exist, take the interval next (in-order) to the point where the search stopped.
Note: contiguous intervals must be fused together in a single node. For example: the available-dates intervals {2 - 15} and {16 - 23} should become {2 - 23}. This might happen if a concert reservation was cancelled.
Alternatively, a tree of non-available dates can be used instead, provided that contiguous non-available intervals are fused together.
Store the scheduled concerts in a binary search tree and find a feasible solution by doing a binary search.
Something like this:
FindDateAfter(tree, x):
n = tree.root
if n.date < x
n = FindDateAfter(n.right, x)
else if n.date > x and n.left.date < x
return n
return FindDateAfter(n.left, x)
FindGoodDay(tree, x):
n = FindDateAfter(tree, x)
while (n.date + 10 < n.right.date)
n = FindDateAfter(n, n.date + 5)
return n.date + 5
I've used a binary search tree (BST) that holds the ranges for valid free days that can be scheduled for performances.
One of the ranges must end with int.MaxValue, because we have an infinite amount of days so it can't be bound.
The following code searches for the closest day to the requested day, and returns it.
The time complexity is O(H) when H is the tree height (usually H=log(N), but can become H=N in some cases.).
The space complexity is the same as the time complexity.
public static int FindConcertTime(TreeNode<Tuple<int, int>> node, int reqDay)
{
// Not found!
if (node == null)
{
return -1;
}
Tuple<int, int> currRange = node.Value;
// Found range.
if (currRange.Item1 <= reqDay &&
currRange.Item2 >= reqDay)
{
// Return requested day.
return reqDay;
}
// Go left.
else if (currRange.Item1 > reqDay)
{
int suggestedDay = FindConcertTime(node.Left, reqDay);
// Didn't find appropriate range in left nodes, or found day
// is further than current option.
if (suggestedDay == -1 || suggestedDay > currRange.Item1)
{
// Return current option.
return currRange.Item1;
}
else
{
// Return suggested day.
return suggestedDay;
}
}
// Go right.
// Will always find because the right-most node has "int.MaxValue" as Item2.
else //if (currRange.Item2 < reqDay)
{
return FindConcertTime(node.Right, reqDay);
}
}
Store the number of used nights per year, quarter, and month. To find a free night, find the first year that is not fully booked, then the quarter within that year, then the month. Then check each of the nights in that month.
Irregularities in the calendar system makes this a little tricky so instead of using years and months you can apply the idea for units of 4 nights as "month", 16 nights as "quarter", and so on.
Assume, at level 1 all schedule details are available.
Group schedule of 16 days schedule at level 2.
Group 16 level 2 status at level 3.
Group 16 level 3 status at level 4.
Depends on number of days that you want to expand, increase the level.
Now search from higher level and do binary search at the end.
Asymtotic complexity:-
It means runtime is changing as the input grows.
suppose we have an input string “abcd”. Here we traverse through each character to find its length thus the time taken is proportional to the no of characters in the string like n no of char. Thus O(n).
but if we put the length of the string “abcd” in a variable then no matter how long the string be we still can find the length of thestring by looking at the variable len. (len=4).
ex: return 23. no matter what you input is we still have the output as 23.
thus the complexity is O(1). Thus th program will be running in a constant time wrt input size.
for O(log n) - the operations are happening in logarithmic steps.
https://drive.google.com/file/d/0B7eUOnXKVyeERzdPUE8wYWFQZlk/view?usp=sharing
Observe the image in the above link. Over here we can see the bended line(logarithmic line). Here we can say that for smaller inputs the O(log n) notation works good as the time taken is less as we can see in the bended line but when the input grows the linear notation i.e O(n) is considered as better way.
There are also the best and worst case scenarios to be seen. Like the above example.
You can also refer to this cheat for the algorithms: http://bigocheatsheet.com/
It was already mentioned above, but basically keep it simple with a binary tree. You know a binary tree has log N complexity. So you already know what algorithm you need to use.
All you have to do is to come up with a tree node structure and use binary tree insertion algorithm to find next available date:
A possible one:
The tree node has two attributes: d (date of the concert) and d+5 (end date for the blocking period of 5 days). Again to keep it simple, use a timestamp for the two date attributes.
Now it is trivial to find next available date by using binary tree inorder insertion algorithm with initial condition of root = null.
Why not try to use Union-Find? You can group each concert day + the next 5 days as part of one set and then perform a FIND on the given day which would return the next set ID which would be your next concert date.
If implemented using a tree, this gives a O(log n) time complexity.

Resources