I'm working on something where I have to store a fairly large number of items (on the order of a few thousand).
Items will be inserted, deleted, and accessed with high frequency; however, the larger an item's value, the more likely it is to be inserted, deleted, or accessed.
Furthermore, I want to support very quick sorting of the first "k" items in the structure, where "k" is fairly small.
So, ideally, the cost of doing these operations will be directly correlated to their value.
A plain old linked list where items are always kept sorted would be the naive approach here, but performance in all cases is still important; I would ideally like operations to be better than O(n) in the general case.
I've brainstormed quite a bit on this issue, and I'm stumped.
I thought at first that a Beap might be the ideal data structure, but searches aren't biased towards the top of the beap; they instead start at a bottom corner and work their way over and upwards. Not what I need.
A binary search tree of some flavor doesn't seem to be the right solution, because although these operations are O(log n) I'd like to do better for larger values.
What I need is almost a binary search tree flipped on its side, so that I can start traversals from the lower-right. But I'm trying to wrap my head around that to see if it would even work for me. I think it would provide O(2 log n) worst-case performance, and better than O(log n) for larger values... but I'm not entirely sure.
Is there such a data structure? Or would I have to invent one?
If probabilistic data structures are acceptable, use (slightly modified) Skip list.
Elements should be stored in decreasing order. Also search operation should be modified to start search at the head element in the bottom list (not in the top list as for normal skip list).
Expected time for insert/delete/search operations is O(log K), where K is the number of elements larger than the inserted/deleted/searched one. Worst case time complexity is O(K).
Why can't you use heaps? They store largest items closer to the top and support all operations in sublinear time. Also, heapsort, which is obviously based on heaps, allows to partially sort your array (take top N elements) very quickly.
Related
I am currently studying algorithms and data structures with the help of the famous Stanford course by Tim Roughgarden. In video 13-1 when explaining Balanced Binary Search Trees he compared them to sorted arrays and mentioned that we do not do deletion on sorted array because it is too slow (I believe he meant "slow in comparison with other operations, that we can run in constant [Select, Min/Max, Pred/Succ], O(log n) [Search, Rank] and O(n) [Output/print] time").
I cannot stop thinking about this statement. Namely I cannot wrap my mind around the following:
Let's say we are given an order statistic or a value of the item we
want to delete from a sorted (ascending) array.
We can most certainly find its position in array using Select or
Search in constant or O(n) time respectively.
We can then remove this item and iterate over the items to the right
of the deleted one, incrementing their indices by one, which will take
O(n) time. [this is me (possibly unsuccessfully) trying to describe
the 'move each of them 1 position to the left' operation]
The whole operation will take linear time - O(n) - in the worst case
scenario.
Key question - Am I thinking in a wrong way? If not, why is it considered slow and undesirable?
You are correct: deleting from an array is slow because you have to move all elements after it one position to the left, so that you can cover the hole you created.
Whether O(n) is considered slow depends on the situation. Deleting from an array is most likely part of a larger, more complex algorithm, e.g. inside a loop. This then would add a factor of n to your final complexity, which is usually bad. Using a tree would only add a factor of log n, and O(n log n) is much better than O(n^2) (asymptotically).
The statement is relative to the specific data structure which is being used to hold the sorted values: A sorted array. This specific data structure would be selected for simplicity, for efficient storage, and for quick searches, but is slow for adding and removing elements from the data structure.
Other data structures which hold sorted values may be selected. For example, a binary tree, or a balanced binary tree, or a trie. Each has different characteristics in terms of operation performance and storage efficiency, and would be selected based on the intended usage.
A sorted array is slow for additions and removals because, on average, these operations require shifting half of the array to make room for a new element (or, respectively, to fill in an emptied cell).
However, on many architectures, the simplicity of the data structure and the speed of shifting means that the data structure is fine for "small" data sets.
I have to store the sorted data in a data structure.
The data structure I want to use is heap or binary search tree.
But I am confused which one would better serve the requirement i.e. fast and efficient searching.
----MORE DETAILS---
I am designing an application that receive data from a source(say a data grid) and then store it into a data structure. The data that comes from data GRID station is in the form of sorted digits. The sorted data can be in ascending or descending order.
now I have to search the data. and the process should be efficient and fast.
A heap will only let you search quickly for the minimum element (find it in O(1) time, remove it in O(log n) time). If you design it the other way, it will let you find the maximum, but you don't get both. To search for arbitrary elements quickly (in O(log n) time), you'll want the binary search tree.
For efficient searching, one would definitely prefer a binary search tree.
To search for a value in a heap may require that you search the entire tree - you can't guarantee that some value may not appear on either the left or right subtree (unless one of the children is already greater than the target value, but this isn't guaranteed to happen).
So searching in a heap takes O(n), where-as it takes O(log n) in a (self-balancing) binary search tree.
A heap is only really preferred if you're primarily interested in finding and/or removing the minimum / maximum, along with insertions.
Either can be constructed in O(n) if you're given already-sorted data.
You mentioned a sorted data structure, but in the "more details" in your question I don't really see that a sorted data structure is required (it doesn't matter too much that that's the form in which your data is already in), but it really depends on exactly what type of queries you will do.
If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O(1) lookups.
Let me make a list of potential data structures and we'll elaborate:
Binary search tree - it contains sorted data so adding new elements is costly (O(log n) I think). When you search through it you can use the binary search which is O(log n). IT is memory efficient and it doesn't need much additional memory.
Hash table (http://en.wikipedia.org/wiki/Hash_table) - every element is stored with a Hash. You can get element by providing the hash. Your elements don't need to be sortable, they only need to provide hashing method. Accessing elements is O(1) which I suppose is pretty decent one :)
I myself usually use hashtables but it depends on what exactly you need to store and how often you add or delete elements.
Check this also: Advantages of Binary Search Trees over Hash Tables
So in my opinion out of heap and binary search list, use Hash table.
I would go with hash table with separate chaining with an AVLTree (I assume collision occurs). It will work better than O(logn) where n is number of items. After getting the index with hash function, m items will be in this index where m is less than or equal to n. (It is usually much smaller, but never more).
O(1) for hashing and O(logm) for searching in AVLTree. This is faster than binary search for sorted data.
Backstory (skip to second-to-last paragraph for data structure part): I'm working on a compression algorithm (of the LZ77 variety). The algorithm boils down to finding the longest match between a given string and all strings that have already been seen.
To do this quickly, I've used a hash table (with separate chaining) as recommended in the DEFLATE spec: I insert every string seen so far one at a time (one per input byte) with m slots in the chain for each hash code. Insertions are fast (constant-time with no conditional logic), but searches are slow because I have to look at O(m) strings to find the longest match. Because I do hundreds of thousands of insertions and tens of thousands of lookups in a typical example, I need a highly efficient data structure if I want my algorithm to run quickly (currently it's too slow for m > 4; I'd like an m closer to 128).
I've implemented a special case where m is 1, which runs very fast buts offers only so-so compression. Now I'm working on an algorithm for those who'd prefer improved compression ratio over speed, where the larger m is, the better the compression gets (to a point, obviously). Unfortunately, my attempts so far are too slow for the modest gains in compression ratio as m increases.
So, I'm looking for a data structure that allows very fast insertion (since I do more insertions than searches), but still fairly fast searches (better than O(m)). Does an O(1) insertion and O(log m) search data structure exist? Failing that, what would be the best data structure to use? I'm willing to sacrifice memory for speed. I should add that on my target platform, jumps (ifs, loops, and function calls) are very slow, as are heap allocations (I have to implement everything myself using a raw byte array in order to get acceptable performance).
So far, I've thought of storing the m strings in sorted order, which would allow O(log m) searches using a binary search, but then the insertions also become O(log m).
Thanks!
You might be interested in this match-finding structure :
http://encode.ru/threads/1393-A-proposed-new-fast-match-searching-structure
It's O(1) insertion time and O(m) lookup. But (m) is many times lower than a standard Hash Table for an equivalent match finding result. As en example, with m=4, this structure gets equivalent results than an 80-probes hash table.
You might want to consider using a trie (aka prefix tree) instead of a hash table.
For your particular application, you might be able to additionally optimize insertion. If you know that after inserting ABC you're likely to insert ABCD, then keep a reference to the entry created for ABC and just extend it with D---no need to repeat the lookup of the prefix.
One common optimization in hash tables is to move the item you just found to the head of the list (with the idea that it's likely to be used again soon for that bucket). Perhaps you can use a variation of this idea.
If you do all of your insertions before you do your lookups, you can add a bit to each bucket that says whether the chain for that bucket is sorted. On each lookup, you can check the bit to see if the bucket is sorted. If not, you would sort the bucket and set the bit. Once the bucket is sorted, each lookup is O(lg m).
If you interleave insertions and lookups, you could have 2 lists for each bucket: one that's sorted and on that isn't. Inserts would always go to the non-sorted list. A lookup would first check the sorted list, and only if it's not there would it look in the non-sorted list. When it's found in the non-sorted list you would remove it and put it in the sorted list. This way you only pay to sort items that you lookup.
I need a data structure that can insert elements and sort itself as quickly as possible. I will be inserting a lot more than sorting. Deleting is not much of a concern and nethier is space. My specific implementation will additionally store nodes in an array, so lookup will be O(1), i.e. you don't have to worry about it.
If you're inserting a lot more than sorting, then it may be best to use an unsorted list/vector, and quicksort it when you need it sorted. This keeps inserts very fast. The one1 drawback is that sorting is a comparatively lengthy operation, since it's not amortized over the many inserts. If you depend on relatively constant time, this can be bad.
1 Come to think of it, there's a second drawback. If you underestimate your sort frequency, this could quickly end up being overall slower than a tree or a sorted list. If you sort after every insert, for instance, then the insert+quicksort cycle would be a bad idea.
Just use one of the self-balanced binary search trees, such as red-black tree.
Use any of the Balanced binary trees like AVL trees. It should give O(lg N) time complexity for both of the operations you are looking for.
If you don't need random access into the array, you could use a Heap.
Worst and average time complexity:
O(log N) insertion
O(1) read largest value
O(log N) to remove the largest value
Can be reconfigured to give smallest value instead of largest. By repeatedly removing the largest/smallest value you get a sorted list in O(N log N).
If you can do a lot of inserts before each sort then obviously you should just append the items and sort no sooner than you need to. My favorite is merge sort. That is O(N*Log(N)), is well behaved, and has a minimum of storage manipulation (new, malloc, tree balancing, etc.)
HOWEVER, if the values in the collection are integers and reasonably dense, you can use an O(N) sort, where you just use each value as an index into a big-enough array, and set a boolean TRUE at that index. Then you just scan the whole array and collect the indices that are TRUE.
You say you're storing items in an array where lookup is O(1). Unless you're using a hash table, that suggests your items may be dense integers, so I'm not sure if you even have a problem.
Regardless, memory allocating/deleting is expensive, and you should avoid it by pre-allocating or pooling if you can.
I had some good experience for that kind of task using a Skip List
At least in my case it was about 5 times faster compared to adding everything to a list first and then running a sort over it at the end.
If I have a sorted list (say quicksort to sort), if I have a lot of values to add, is it better to suspend sorting, and add them to the end, then sort, or use binary chop to place the items correctly while adding them. Does it make a difference if the items are random, or already more or less in order?
If you add enough items that you're effectively building the list from scratch, you should be able to get better performance by sorting the list afterwards.
If items are mostly in order, you can tweak both incremental update and regular sorting to take advantage of that, but frankly, it usually isn't worth the trouble. (You also need to be careful of things like making sure some unexpected ordering can't make your algorithm take much longer, q.v. naive quicksort)
Both incremental update and regular list sort are O(N log N) but you can get a better constant factor sorting everything afterward (I'm assuming here that you've got some auxiliary datastructure so your incremental update can access list items faster than O(N)...). Generally speaking, sorting all at once has a lot more design freedom than maintaining the ordering incrementally, since incremental update has to maintain a complete order at all times, but an all-at-once bulk sort does not.
If nothing else, remember that there are lots of highly-optimized bulk sorts available.
Usually it's far better to use a heap. in short, it splits the cost of maintaining order between the pusher and the picker. Both operations are O(log n), instead of O(n log n), like most other solutions.
If you're adding in bunches, you can use a merge sort. Sort the list of items to be added, then copy from both lists, comparing items to determine which one gets copied next. You could even copy in-place if resize your destination array and work from the end backwards.
The efficiency of this solution is O(n+m) + O(m log m) where n is the size of the original list, and m is the number of items being inserted.
Edit: Since this answer isn't getting any love, I thought I'd flesh it out with some C++ sample code. I assume that the sorted list is kept in a linked list rather than an array. This changes the algorithm to look more like an insertion than a merge, but the principle is the same.
// Note that itemstoadd is modified as a side effect of this function
template<typename T>
void AddToSortedList(std::list<T> & sortedlist, std::vector<T> & itemstoadd)
{
std::sort(itemstoadd.begin(), itemstoadd.end());
std::list<T>::iterator listposition = sortedlist.begin();
std::vector<T>::iterator nextnewitem = itemstoadd.begin();
while ((listposition != sortedlist.end()) || (nextnewitem != itemstoadd.end()))
{
if ((listposition == sortedlist.end()) || (*nextnewitem < *listposition))
sortedlist.insert(listposition, *nextnewitem++);
else
++listposition;
}
}
I'd say, let's test it! :)
I tried with quicksort, but sorting an almost sorting array with quicksort is... well, not really a good idea. I tried a modified one, cutting off at 7 elements and using insertion sort for that. Still, horrible performance. I switched to merge sort. It might need quite a lot of memory for sorting (it's not in-place), but the performance is much better on sorted arrays and almost identical on random ones (the initial sort took almost the same time for both, quicksort was only slightly faster).
This already shows one thing: The answer to your questions depends strongly on the sorting algorithm you use. If it will have poor performance on almost sorted lists, inserting at the right position will be much faster than adding at the end and then re-sorting it; and merge sort might be no option for you, as it might need way too much external memory if the list is huge. BTW I used a custom merge sort implementation, that only uses 1/2 of external storage to the naive implementation (which needs as much external storage as the array size itself).
If merge sort is no option and quicksort is no option for sure, the best alternative is probably heap sort.
My results are: Adding the new elements simply at the end and then re-sorting the array was several magnitudes faster than inserting them in the right position. However, my initial array had 10 mio elements (sorted) and I was adding another mio (unsorted). So if you add 10 elements to an array of 10 mio, inserting them correctly is much faster than re-sorting everything. So the answer to your question also depends on how big the initial (sorted) array is and how many new elements you want to add to it.
In principle, it's faster to create a tree than to sort a list. The tree inserts are O(log(n)) for each insert, leading to overall O(nlog(n)). Sorting in O(nlog(n)).
That's why Java has TreeMap, (in addition to TreeSet, TreeList, ArrayList and LinkedList implementations of a List.)
A TreeSet keeps things in object comparison order. The key is defined by the Comparable interface.
A LinkedList keeps things in the insertion order.
An ArrayList uses more memory, is faster for some operations.
A TreeMap, similarly, removes the need to sort by a key. The map is built in key order during the inserts and maintained in sorted order at all times.
However, for some reason, the Java implementation of TreeSet is quite a bit slower than using an ArrayList and a sort.
[It's hard to speculate as to why it would be dramatically slower, but it is. It should be slightly faster by one pass through the data. This kind of thing is often the cost of memory management trumping the algorithmic analysis.]
It's about the same. Inserting an item into a sorted list is O(log N), and doing this for every element in the list, N, (thus building the list) would be O(N log N) which is the speed of quicksort (or merge sort which is closer to this approach).
If you instead inserted them onto the front it would be O(1), but doing a quicksort after, it would still be O(N log N).
I would go with the first approach, because it has the potential to be slightly faster. If the initial size of your list, N, is much greater than the number of elements to insert, X, then the insert approach is O(X log N). Sorting after inserting to the head of the list is O(N log N). If N=0 (IE: your list is initially empty), the speed of inserting in sorted order, or sorting afterwards are the same.
Inserting an item into a sorted list takes O(n) time, not O(log n) time. You have to find the place to put it, taking O(log n) time. But then you have to shift over all the elements - taking O(n) time. So inserting while maintaining sorted-ness is O(n ^ 2), where as inserting them all and then sorting is O(n log n).
Depending on your sort implementation, you can get even better than O(n log n) if the number of inserts is much smaller than the list size. But if that is the case, it doesn't matter either way.
So do the insert all and sort solution if the number of inserts is large, otherwise it probably won't matter.
If the list is a) already sorted, and b) dynamic in nature, then inserting into a sorted list should always be faster (find the right place (O(n)) and insert (O(1))).
However, if the list is static, then a shuffle of the remainder of the list has to occur (O(n) to find the right place and O(n) to slide things down).
Either way, inserting into a sorted list (or something like a Binary Search Tree) should be faster.
O(n) + O(n) should always be faster than O(N log n).
At a high level, it's a pretty simple problem, because you can think of sorting as just iterated searching. When you want to insert an element into an ordered array, list, or tree, you have to search for the point at which to insert it. Then you put it in, at hopefully low cost. So you could think of a sort algorithm as just taking a bunch of things and, one by one, searching for the proper position and inserting them. Thus, an insertion sort (O(n* n)) is an iterated linear search (O(n)). Tree, heap, merge, radix, and quick sort (O(n*log(n))) can be thought of as iterated binary search (O(log(n))). It is possible to have an O(n) sort, if the underlying search is O(1) as in an ordered hash table. (An example of this is sorting 52 cards by flinging them into 52 bins.)
So the answer to your question is, inserting things one at a time, versus saving them up and then sorting them should not make much difference, in a big-O sense. You could of course have constant factors to deal with, and those might be significant.
Of course, if n is small, like 10, the whole discussion is silly.
You should add them before and then use a radix sort this should be optimal
http://en.wikipedia.org/wiki/Radix_sort#Efficiency
(If the list you're talking about is like C# List<T>.) Adding some values to right positions into a sorted list with many values is going to require less operations. But if the number of values being added becomes large, it will require more.
I would suggest using not a list but some more suitable data structure in your case. Like a binary tree, for example. A sorted data structure with minimal insertion time.
If this is .NET and the items are integers, it's quicker to add them to a Dictionary (or if you're on .Net 3.0 or above use the HashSet if you don't mind losing duplicates)This gives you automagic sorting.
I think that strings would work the same way as well. The beauty is you get O(1) insertion and sorting this way.
Inserting an item into a sorted list is O(log n), while sorting a list is O(n log N)
Which would suggest that it's always better to sort first and then insert
But remeber big 'O' only concerns the scaling of the speed with number of items, it might be that for your application an insert in the middle is expensive (eg if it was a vector) and so appending and sorting afterward might be better.