I have a massive (~109) set of events characterized by a start and end time. Given a time, I want to find how many on those events were ongoing at that time.
What sort of data structure would be helpful in this situation? The operations I need to be fast are:
Inserting new events, e.g., {start: 100000 milliseconds, end: 100010 milliseconds}.
Querying the number of concurrent events at a given time.
Update: Someone put a computational geometry flag on this, so I figure I should rephrase this in terms of computational geometry. I have a set of 1-dimensional intervals and I want to calculate how many of those intervals intersect with a given point. Insertion of new intervals must be fast.
You're looking for an interval tree.
Construction: O(n log n), where n is the number of intervals
Query: O(m+log n), where m is the number of query results and n is the number of intervals
Space: O(n)
Just to add to the other answers, depending on the length of time and the granularity desired, you could simply have an array of counters. For example, if the length of time is 24 hours and the desired granularity is 1 millisecond, there will be 86,400,000 cells in the array. With one 4 byte int per cell (which is enough to hold 10^9), that will be less than 700 MB of RAM, versus tree-based solutions which would take at least (8+8+4+4)*10^9 = 24 GB of RAM for two pointers plus two ints per tree node (since 32 bits of addressable memory is insufficient, you'd need 64 bits per pointer). You can use swap, but this will slow down some queries considerably.
You can also use this solution if you only care about the last 24 hours of data, for example, by using the array as a circular buffer. Besides the limitation on time and granularity, the other downside is that insertion time of an interval is proportional to the length of the interval, so if interval length is unbounded, you could be in trouble. Queries, on the other hand, are a single array lookup.
(Extending the answers by tskuzzy and Snowball)
A balanced binary search tree makes sense, except that the memory requirements would be excessive for your data set. A B-tree would be much more memory efficient, albeit more complicated unless you can use a library.
Keep two trees, one of start times and one of end times. To insert an event, add the start time to the tree of start times and the end time to the tree of end times. To query the number of active events at time T, search the the start-time tree to find out how many start times are less than T, and search the end-time tree to find out how many end times are less than T. Subtract the number of end times from the number of start times, and that's the number of active events.
Insertions and queries should both take O(log N) time.
A few comments:
The way you have phrased the question, you only care about the number of active events, not which events were active. This means you do not need to keep track of which start time goes with which end time! This also makes it easier to avoid the "+M" term in the queries cited by previous answers.
Be careful about the exact semantics of your query. In particular, does an event count as active at time T if it starts at time T? If it ends at time T? The anwers to these questions affect whether you use < or <= in certain places.
Do not use a "set" data structure, because you almost certainly want to allow and count duplicates. That is, more than one event might start and/or end at the same time. A set would typically ignore duplicates. What you are looking for instead is a "multiset" (sometimes called a "bag").
Many binary search trees do not support "number of elements < T" queries out of the box. But it is easy to add this functionality by storing a size at each node.
Suppose we have a sorted set (e.g., a balanced binary search tree or a skip list) data structure with N elements. Furthermore, suppose that the sorted set has O(log N) search time, O(log N) insert time, and O(N) space usage (these are reasonable assumptions, see red-black tree, for example).
One possibility is to have two sorted sets, bystart and byend, respectively sorted by the start and end times of the events.
To find the number of events that are ongoing at time t, ask byend for the first interval whose end time is greater than t: an O(log N) search operation. Call the start time of this interval left. Now, ask bystart for the number of intervals whose start time is greater than or equal to left and less than t. This is O(log N + M), where M is the number of such intervals. So, the total time for a search is O(log N + M).
Insertion was O(log N) for sorted sets, which we have to do once for each sorted set. This makes the total time for the insertion operation O(log N).
Construction of this structure from scratch just consists of N insertion operations, so the total time for construction is O(N log N).
Space usage is O(N) for each sorted set, so the total space usage is O(N).
Summary:
Insert: O(log N), where N is the number of intervals
Construct: O(N log N)
Query: O(log N + M), where M is the number of results
Space: O(N)
Related
Is there a data structure with elements that can be indexed whose insertion runtime is O(1)? So for example, I could index the data structure like so: a[4], and yet when inserting an element at an arbitrary place in the data structure that the runtime is O(1)? Note that the data structure does not maintain sorted order, just the ability for each sequential element to have an index.
I don't think its possible, since inserting somewhere that is not at the end or beginning of the ordered data structure would mean that all the indicies after insertion must be updated to know that their index has increased by 1, which would take worst case O(n) time. If the answer is no, could someone prove it mathematically?
EDIT:
To clarify, I want to maintain the order of insertion of elements, so upon inserting, the item inserted remains sequentially between the two elements it was placed between.
The problem that you are looking to solve is called the list labeling problem.
There are lower bounds on the cost that depend on the relationship between the the maximum number of labels you need (n), and the number of possible labels (m).
If n is in O(log m), i.e., if the number of possible labels is exponential in the number of labels you need at any one time, then O(1) cost per operation is achievable... but this is not the usual case.
If n is in O(m), i.e., if they are proportional, then O(log2 n) per operation is the best you can do, and the algorithm is complicated.
If n <= m2, then you can do O(log N). Amortized O(log N) is simple, and O(log N) worst case is hard. Both algorithms are described in this paper by Dietz and Sleator. The hard way makes use of the O(log2 n) algorithm mentioned above.
HOWEVER, maybe you don't really need labels. If you just need to be able to compare the order of two items in the collection, then you are solving a slightly different problem called "list order maintenance". This problem can actually be solved in constant time -- O(1) cost per operation and O(1) cost to compare the order of two items -- although again O(1) amortized cost is a lot easier to achieve.
When inserting into slot i, append the element which was first at slot i to the end of the sequence.
If the sequence capacity must be grown, then this growing may not necessarily be O(1).
Suppose the list of intervals may be [[1,3],[2,4],[6,12]] and the Query time T = 3. The number of intervals which have 3 in the above list are 2 (i.e) [[1,3],[2,4]]. Is it possible to do this in O(logn) time?
This cannot be done in O(log n) time in the general case.
You can binary search on the start time to find the last interval that could possibly contain the query time, but because there's no implied ordering on the end times, you have sequentially search from the start of the list to the item you identified as the last to determine if the query time is in any of those intervals.
Consider, for example, [(1,7),(2,11),(3,8),(4,5),(6,10),(7,9)], with a query time of 7.
Binary search on the start time will tell you that all of the intervals could contain the query time. But because the ending times are not in any particular order, you can't do a binary search on them. You have to look at each individual interval to determine if the ending time is greater than or equal to the query time. Here, you see that (4,5) does not contain the query time.
Well, one thing to note is that for an interval to contain T, its start time must be less than or equal to T. Since these are sorted by start time, you can use a basic binary search to eliminate all the ones which start too late in O(log n) time.
If we can assume that these are also sorted by end time -- that is, no interval completely encompasses a previous interval -- then you can use another binary search to eliminate all the ones whose end times are before T. That will keep the running time in O(log n).
If we can't make that assumption, things get more complex, and I can't think of any way to do better than O(n log n) [by sorting the remaining list by end time and performing another binary search on that]. Perhaps there's a way?
EDIT As Qbyte says below, the final sort is superfluous; you can get it down to O(n) with a simple linear search on the remaining set. Then again, if you're going with an O(n) solution anyway, you may as well skip the entire algorithm and just do a linear search on the original set.
Let's take your assumption that the intervals are sorted by start time. A binary search O(log n) will eliminate the intervals that can't contain T. The remaining might.
Assuming End Time is not also Sorted (OP)
You have to scan the remaining ones, O(n), counting them. Total complexity O(n). Given this, you might as well have never binary searched and just scanned the whole list.
Assuming End Time is also Sorted
If the remaining ones are sorted by end time as well, you can do another binary search, keeping the complexity at O(log n).
But you're not done. You need the count.
You know the count to start with. If you didn't you couldn't have binary searched. You will know the indexes of the last tests of each binary search. From here it's an O(1) calculation option.
Thus the total complexity is O(log n) for this option.
I'm looking to implement an algorithm, which is given an array of integers and a list of ranges (intervals) in that array, returns the number of distinct elements in each interval. That is, given the array A and a range [i,j] returns the size of the set {A[i],A[i+1],...,A[j]}.
Obviously, the naive approach (iterate from i to j and count ignoring duplicates) is too slow. Range-Sum seems inapplicable, since A U B - B isn't always equal to B.
I've looked up Range Queries in Wikipedia, and it hints that Yao (in '82) showed an algorithm that does this for semigroup operators (which union seems to be) with linear preprocessing time and space and almost constant query time. The article, unfortunately, is not available freely.
Edit: it appears this exact problem is available at http://www.spoj.com/problems/DQUERY/
There's rather simple algorithm which uses O(N log N) time and space for preprocessing and O(log N) time per query. At first, create a persistent segment tree for answering range sum query(initially, it should contain zeroes at all the positions). Then iterate through all the elements of the given array and store the latest position of each number. At each iteration create a new version of the persistent segment tree putting 1 to the latest position of each element(at each iteration the position of only one element can be updated, so only one position's value in segment tree changes so update can be done in O(log N)). To answer a query (l, r) You just need to find sum on (l, r) segment for the version of the tree which was created when iterating through the r's element of the initial array.
Hope this algorithm is fast enough.
Upd. There's a little mistake in my explanation: at each step, at most two positions' values in the segment tree might change(because it's necessary to put 0 to a previous latest position of a number if it's updated). However, it doesn't change the complexity.
You can answer any of your queries in constant time by performing a quadratic-time precomputation:
For every i from 0 to n-1
S <- new empty set backed by hashtable;
C <- 0;
For every j from i to n-1
If A[j] does not belong to S, increment C and add A[j] to S.
Stock C as the answer for the query associated to interval i..j.
This algorithm takes quadratic time since for each interval we perform a bounded number of operations, each one taking constant time (note that the set S is backed by a hashtable), and there's a quadratic number of intervals.
If you don't have additional information about the queries (total number of queries, distribution of intervals), you cannot do essentially better, since the total number of intervals is already quadratic.
You can trade off the quadratic precomputation by n linear on-the-fly computations: after receiving a query of the form A[i..j], precompute (in O(n) time) the answer for all intervals A[i..k], k>=i. This will guarantee that the amortized complexity will remain quadratic, and you will not be forced to perform the complete quadratic precomputation at the beginning.
Note that the obvious algorithm (the one you call obvious in the statement) is cubic, since you scan every interval completely.
Here is another approach which might be quite closely related to the segment tree. Think of the elements of the array as leaves of a full binary tree. If there are 2^n elements in the array there are n levels of that full tree. At each internal node of the tree store the union of the points that lie in the leaves beneath it. Each number in the array needs to appear once in each level (less if there are duplicates). So the cost in space is a factor of log n.
Consider a range A..B of length K. You can work out the union of points in this range by forming the union of sets associated with leaves and nodes, picking nodes as high up the tree as possible, as long as the subtree beneath those nodes is entirely contained in the range. If you step along the range picking subtrees that are as big as possible you will find that the size of the subtrees first increases and then decreases, and the number of subtrees required grows only with the logarithm of the size of the range - at the beginning if you could only take a subtree of size 2^k it will end on a boundary divisible by 2^(k+1) and you will have the chance of a subtree of size at least 2^(k+1) as the next step if your range is big enough.
So the number of semigroup operations required to answer a query is O(log n) - but note that the semigroup operations may be expensive as you may be forming the union of two large sets.
Is anybody able to give a 'plain english' intuitive, yet formal, explanation of what makes QuickSort n log n? From my understanding it has to make a pass over n items, and it does this log n times...Im not sure how to put it into words why it does this log n times.
Complexity
A Quicksort starts by partitioning the input into two chunks: it chooses a "pivot" value, and partitions the input into those less than the pivot value and those larger than the pivot value (and, of course, any equal to the pivot value have go into one or the other, of course, but for a basic description, it doesn't matter a lot which those end up in).
Since the input (by definition) isn't sorted, to partition it like that, it has to look at every item in the input, so that's an O(N) operation. After it's partitioned the input the first time, it recursively sorts each of those "chunks". Each of those recursive calls looks at every one of its inputs, so between the two calls it ends up visiting every input value (again). So, at the first "level" of partitioning, we have one call that looks at every input item. At the second level, we have two partitioning steps, but between the two, they (again) look at every input item. Each successive level has more individual partitioning steps, but in total the calls at each level look at all the input items.
It continues partitioning the input into smaller and smaller pieces until it reaches some lower limit on the size of a partition. The smallest that could possibly be would be a single item in each partition.
Ideal Case
In the ideal case we hope each partitioning step breaks the input in half. The "halves" probably won't be precisely equal, but if we choose the pivot well, they should be pretty close. To keep the math simple, let's assume perfect partitioning, so we get exact halves every time.
In this case, the number of times we can break it in half will be the base-2 logarithm of the number of inputs. For example, given 128 inputs, we get partition sizes of 64, 32, 16, 8, 4, 2, and 1. That's 7 levels of partitioning (and yes log2(128) = 7).
So, we have log(N) partitioning "levels", and each level has to visit all N inputs. So, log(N) levels times N operations per level gives us O(N log N) overall complexity.
Worst Case
Now let's revisit that assumption that each partitioning level will "break" the input precisely in half. Depending on how good a choice of partitioning element we make, we might not get precisely equal halves. So what's the worst that could happen? The worst case is a pivot that's actually the smallest or largest element in the input. In this case, we do an O(N) partitioning level, but instead of getting two halves of equal size, we've ended up with one partition of one element, and one partition of N-1 elements. If that happens for every level of partitioning, we obviously end up doing O(N) partitioning levels before even partition is down to one element.
This gives the technically correct big-O complexity for Quicksort (big-O officially refers to the upper bound on complexity). Since we have O(N) levels of partitioning, and each level requires O(N) steps, we end up with O(N * N) (i.e., O(N2)) complexity.
Practical implementations
As a practical matter, a real implementation will typically stop partitioning before it actually reaches partitions of a single element. In a typical case, when a partition contains, say, 10 elements or fewer, you'll stop partitioning and and use something like an insertion sort (since it's typically faster for a small number of elements).
Modified Algorithms
More recently other modifications to Quicksort have been invented (e.g., Introsort, PDQ Sort) which prevent that O(N2) worst case. Introsort does so by keeping track of the current partitioning "level", and when/if it goes too deep, it'll switch to a heap sort, which is slower than Quicksort for typical inputs, but guarantees O(N log N) complexity for any inputs.
PDQ sort adds another twist to that: since Heap sort is slower, it tries to avoid switching to heap sort if possible To to that, if it looks like it's getting poor pivot values, it'll randomly shuffle some of the inputs before choosing a pivot. Then, if (and only if) that fails to produce sufficiently better pivot values, it'll switch to using a Heap sort instead.
Each partitioning operation takes O(n) operations (one pass on the array).
In average, each partitioning divides the array to two parts (which sums up to log n operations). In total we have O(n * log n) operations.
I.e. in average log n partitioning operations and each partitioning takes O(n) operations.
There's a key intuition behind logarithms:
The number of times you can divide a number n by a constant before reaching 1 is O(log n).
In other words, if you see a runtime that has an O(log n) term, there's a good chance that you'll find something that repeatedly shrinks by a constant factor.
In quicksort, what's shrinking by a constant factor is the size of the largest recursive call at each level. Quicksort works by picking a pivot, splitting the array into two subarrays of elements smaller than the pivot and elements bigger than the pivot, then recursively sorting each subarray.
If you pick the pivot randomly, then there's a 50% chance that the chosen pivot will be in the middle 50% of the elements, which means that there's a 50% chance that the larger of the two subarrays will be at most 75% the size of the original. (Do you see why?)
Therefore, a good intuition for why quicksort runs in time O(n log n) is the following: each layer in the recursion tree does O(n) work, and since each recursive call has a good chance of reducing the size of the array by at least 25%, we'd expect there to be O(log n) layers before you run out of elements to throw away out of the array.
This assumes, of course, that you're choosing pivots randomly. Many implementations of quicksort use heuristics to try to get a nice pivot without too much work, and those implementations can, unfortunately, lead to poor overall runtimes in the worst case. #Jerry Coffin's excellent answer to this question talks about some variations on quicksort that guarantee O(n log n) worst-case behavior by switching which sorting algorithms are used, and that's a great place to look for more information about this.
Well, it's not always n(log n). It is the performance time when the pivot chosen is approximately in the middle. In worst case if you choose the smallest or the largest element as the pivot then the time will be O(n^2).
To visualize 'n log n', you can assume the pivot to be element closest to the average of all the elements in the array to be sorted.
This would partition the array into 2 parts of roughly same length.
On both of these you apply the quicksort procedure.
As in each step you go on halving the length of the array, you will do this for log n(base 2) times till you reach length = 1 i.e a sorted array of 1 element.
Break the sorting algorithm in two parts. First is the partitioning and second recursive call. Complexity of partioning is O(N) and complexity of recursive call for ideal case is O(logN). For example, if you have 4 inputs then there will be 2(log4) recursive call. Multiplying both you get O(NlogN). It is a very basic explanation.
In-fact you need to find the position of all the N elements(pivot),but the maximum number of comparisons is logN for each element (the first is N,second pivot N/2,3rd N/4..assuming pivot is the median element)
In the case of the ideal scenario, the first level call, places 1 element in its proper position. there are 2 calls at the second level taking O(n) time combined but it puts 2 elements in their proper position. in the same way. there will be 4 calls at the 3rd level which would take O(n) combined time but will place 4 elements into their proper position. so the depth of the recursive tree will be log(n) and at each depth, O(n) time is needed for all recursive calls. So time complexity is O(nlogn).
I am currently having trouble identifying and understanding the complexity time of the following algorithm.
Background: There is a list of files, each containing a list of candidate Ids. Both, number of files and number of candidates within them are not fixed.
How would you calculate the time complexity for an algorithm which is responsible for:
Reading each file and adding all the unique candidate Ids into a Hashset?
Thanks.
i'm just repeating what amit said, so please give him the upvote if that is clear to you - i find that explanation a bit confusing.
your average complexity is O(n) where n is the total number of candidates (from all files). so if you have a files, each with b candidates then the time taken is proportional to a * b.
this is because the simplest way to solve your problem is to simply loop through all the data, adding them to the set. the set will discard duplicates as necessary.
looping over all values takes time proportional to the number of values (that is the O(n) part). adding a value to a hash set takes constant time (or O(1)). since that is constant time per entry, your overall time remains O(n).
however, hash sets have a strange worst case behaviour - they take time proportional to the size of the contents in some (unusual) cases. so in the very worst case, each time you add a value it requires O(m) amount of work, where m is the number of entries in the set.
now m is (approximately - it starts at zero and goes up to...) the number of distinct values. so we have two common cases:
if the number of distinct candidates increases as we read more (so, for example, 90% of the files are always new candidates) then m is proportional to n. that means that the work of adding each candidate increases proportional to n. so the total work is proportional to n^2 (since for each candidate we do work proportional to n, and there are n candidates). so the worst case is O(n^2).
if the number of distinct candidates is actually fixed, then as you read more and more files they tend to be just full of known candidates. in that case the extra work for inserting into the set is constant (you only get the strange behaviour a fixed number of times for the unique candidates - it doesn't depend on n). in that case the performance of the set does not keep getting worse as n gets larger and larger, so the worst case complexity remains O(n).