find index of element inside a collection, which collection to use? - algorithm

I have a problem choosing the right data structure/s, these are the requirements:
I must be able to insert and delete elements
I must also be able to get the index of the element in the collection (order in the collection)
Elements has an unique identifier number
I can sort (if necessary) the elements using any criterium
Ordering is not really a must, the important thing is getting the index of the element, no matters how is internally implemented, but anyway I think that the best approach is ordering.
The index of the element is the order inside the collection. So some kind of order has to be used. When I delete an element, the other elements from that to the end change their order/index.
First approach is using a linked list, but I don't want O(n).
I have also thinked about using and ordered dictionary, that would give O(log n) for lookup/insert/delete, isn't it?
Is there a better approach? I know a TRIE would give O(1) for common operations, but I don't see how to get the index of an element, I would have to iterate over the trie and would give O(n), am I wrong?

Sounds like you want an ordered data structure, i.e. a (balanced) BST. Insertion and deletion would indeed be O(lg n), which suffices for many applications. If you also want elements to have an index in the structure, then you'd want an order statistic tree (see e.g., CLR, Introduction to Algorithms, chapter 14) which provides this operation in O(lg n). Dynamically re-sorting the entire collection would be O(n lg n).
If by "order in the collection" you mean any random order is good enough, then just use a dynamic array (vector): amortized O(1) append and delete, O(n lg n) in-place sort, but O(n) lookup until you do the sort, after which lookup becomes O(lg n) with binary search. Deletion would be O(n) if the data is to remain sorted, though.
If your data is string-like, you might be able to extend a trie in the same that a BST is extended to become an order statistic tree.

You don't mention an array/vector here, but it meets most of these criteria.
(Note that "Elements has a unique identifer number" is really irrespective of datastructure; does this mean the same thing as the index? Or is it an immutable key, which is more a function of the data you're putting into the structure...)
There are going to be timing tradeoffs in any scenario: you say linked list is O(n), but O(n) for what? You don't really get into your performance requirements for additions vs. deletes vs. searches; which is more important?

Well if your collection is sorted, you don't need O(n) to find elements. It's possible to use binary search for example to determine index of element. Also it's possible to write simple wrapper about Entry inside your array, which remember its index inside collection.

Related

Datastructure for fast and efficient search

I have to store the sorted data in a data structure.
The data structure I want to use is heap or binary search tree.
But I am confused which one would better serve the requirement i.e. fast and efficient searching.
----MORE DETAILS---
I am designing an application that receive data from a source(say a data grid) and then store it into a data structure. The data that comes from data GRID station is in the form of sorted digits. The sorted data can be in ascending or descending order.
now I have to search the data. and the process should be efficient and fast.
A heap will only let you search quickly for the minimum element (find it in O(1) time, remove it in O(log n) time). If you design it the other way, it will let you find the maximum, but you don't get both. To search for arbitrary elements quickly (in O(log n) time), you'll want the binary search tree.
For efficient searching, one would definitely prefer a binary search tree.
To search for a value in a heap may require that you search the entire tree - you can't guarantee that some value may not appear on either the left or right subtree (unless one of the children is already greater than the target value, but this isn't guaranteed to happen).
So searching in a heap takes O(n), where-as it takes O(log n) in a (self-balancing) binary search tree.
A heap is only really preferred if you're primarily interested in finding and/or removing the minimum / maximum, along with insertions.
Either can be constructed in O(n) if you're given already-sorted data.
You mentioned a sorted data structure, but in the "more details" in your question I don't really see that a sorted data structure is required (it doesn't matter too much that that's the form in which your data is already in), but it really depends on exactly what type of queries you will do.
If you're only going to search for exact values, you don't really need a sorted data structure, and can use a hash table instead, which supports expected O(1) lookups.
Let me make a list of potential data structures and we'll elaborate:
Binary search tree - it contains sorted data so adding new elements is costly (O(log n) I think). When you search through it you can use the binary search which is O(log n). IT is memory efficient and it doesn't need much additional memory.
Hash table (http://en.wikipedia.org/wiki/Hash_table) - every element is stored with a Hash. You can get element by providing the hash. Your elements don't need to be sortable, they only need to provide hashing method. Accessing elements is O(1) which I suppose is pretty decent one :)
I myself usually use hashtables but it depends on what exactly you need to store and how often you add or delete elements.
Check this also: Advantages of Binary Search Trees over Hash Tables
So in my opinion out of heap and binary search list, use Hash table.
I would go with hash table with separate chaining with an AVLTree (I assume collision occurs). It will work better than O(logn) where n is number of items. After getting the index with hash function, m items will be in this index where m is less than or equal to n. (It is usually much smaller, but never more).
O(1) for hashing and O(logm) for searching in AVLTree. This is faster than binary search for sorted data.

What is faster: sort n elements, or insert n elements one by one in correct place?

Generally, what is better: insert N elements in some collection and then sort it, or find out the correct place for the element before insert and insert it exactly in that place (repeat N times)?
It is very dependent on the data structure and application in use.
Note that inserting an element in an array requires shifting all the following elements to the right, which results in O(n) insertion.
A Binary Search Tree however, allows insertion in O(logn), but is less cache efficient then an array - and thus slower.
On the other hand, inserting and then sorting results in high latency after the last element was inserted [The O(nlogn) sort].
Also - if you are going to query very often - but add elements seldom - you want to avoid sorting too often - and keeping elements in order is an easy way to achieve this.
Without making any assumption on the input, sorting takes nlog(n) time. Inserting an element takes O(n) time (linked list). So, sorting after all insertions is faster.
This is better
"insert N elements in some collection and then sort it"
because one can use O(nlogn) algorithm to sort. Where as this
"find out the correct place for the element before insert and insert
it exactly in that place (repeat N times)?"
is insertion sort, known for worst case of O(n^2).
That said, insertion sort is online algorithm. i.e. you don't need to know the whole data in advance to start sorting. Consider the data being generated by some other program running in parallel. Here insertion sort makes sense. Where as the other approach require whole data to be in place.
Depends on your data structure. Note though that to find out the right place before insertion you have to sort your elements somehow, which requires them to be already stored in a data structure. Can you provide additional information on what you are trying to achieve?

sorting algorithm suitable for a sorted list

I have a sorted list at hand. Now i add a new element to the end of the list. Which sorting algorithm is suitable for such scenario?
Quick sort has worst case time complexity of O(n2) when the list is already sorted. Does this mean time complexity if quick sort is used in the above case will be close to O(n2)?
If you are adding just one element, find the position where it should be inserted and put it there. For an array, you can do binary search for O(logN) time and insert in O(N). For a linked list, you'll have to do a linear search which will take O(N) time but then insertion is O(1).
As for your question on quicksort: If you choose the first value as your pivot, then yes it will be O(N2) in your case. Choose a random pivot and your case will still be O(NlogN) on average. However, the method I suggest above is both easier to implement and faster in your specific case.
It depends on the implementation of the underlying list.
It seems to me that insertion sort will fit your needs except the case when the list is implemented as an array list. In this case too many moves will be required.
Rather than appending to the end of the list, you should do an insert operation.
That is, when adding 5 to [1,2,3,4,7,8,9] you'd result want to the "insert" by putting it where it belongs in the sorted list, instead of at the end and then re-sorting the whole list.
You can quickly find the position to insert the item by using a binary search.
This is basically how insertion sort works, except it operates on the entire list. This method will have better performance than even the best sorting algorithm, for a single item. It may also be faster than appending at the end of the list, depending on your implementation.
I'm assuming you're using an array, since you talk about quicksort, so just adding an element would involve finding the place to insert it (O(log n)) and then actually inserting it (O(n)) for a total cost of O(n). Just appending it to the end and then resorting the entire list is definitely the wrong way to go.
However, if this is to be a frequent operation (i.e. if you have to keep adding elements while maintaining the sorted property) you'll incur an O(n^2) cost of adding another n elements to the list. If you change your representation to a balanced binary tree, that drops to O(n log n) for another n inserts, but finding an element by index will become O(n). If you never need to do this, but just iterate over the elements in order, the tree is definitely the way to go.
Of possible interest is the indexable skiplist which, for a slight storage cost, has O(log n) inserts, deletes, searches and lookups-by-index. Give it a look, it might be just what you're looking for here.
What exactly do you mean by "list" ? Do you mean specifically a linked list, or just some linear (sequential) data structure like an array?
If it's linked list, you'll need a linear search for the correct position. The insertion itself can be done in constant time.
If it's something like an array, you can add to the end and sort, as you mentioned. A sorted collection is only bad for Quicksort if the Quicksort is really badly implemented. If you select your pivot with the typical median of 3 alogrithm, a sorted list will give optimal performance.

Fastest data structure for inserting/sorting

I need a data structure that can insert elements and sort itself as quickly as possible. I will be inserting a lot more than sorting. Deleting is not much of a concern and nethier is space. My specific implementation will additionally store nodes in an array, so lookup will be O(1), i.e. you don't have to worry about it.
If you're inserting a lot more than sorting, then it may be best to use an unsorted list/vector, and quicksort it when you need it sorted. This keeps inserts very fast. The one1 drawback is that sorting is a comparatively lengthy operation, since it's not amortized over the many inserts. If you depend on relatively constant time, this can be bad.
1 Come to think of it, there's a second drawback. If you underestimate your sort frequency, this could quickly end up being overall slower than a tree or a sorted list. If you sort after every insert, for instance, then the insert+quicksort cycle would be a bad idea.
Just use one of the self-balanced binary search trees, such as red-black tree.
Use any of the Balanced binary trees like AVL trees. It should give O(lg N) time complexity for both of the operations you are looking for.
If you don't need random access into the array, you could use a Heap.
Worst and average time complexity:
O(log N) insertion
O(1) read largest value
O(log N) to remove the largest value
Can be reconfigured to give smallest value instead of largest. By repeatedly removing the largest/smallest value you get a sorted list in O(N log N).
If you can do a lot of inserts before each sort then obviously you should just append the items and sort no sooner than you need to. My favorite is merge sort. That is O(N*Log(N)), is well behaved, and has a minimum of storage manipulation (new, malloc, tree balancing, etc.)
HOWEVER, if the values in the collection are integers and reasonably dense, you can use an O(N) sort, where you just use each value as an index into a big-enough array, and set a boolean TRUE at that index. Then you just scan the whole array and collect the indices that are TRUE.
You say you're storing items in an array where lookup is O(1). Unless you're using a hash table, that suggests your items may be dense integers, so I'm not sure if you even have a problem.
Regardless, memory allocating/deleting is expensive, and you should avoid it by pre-allocating or pooling if you can.
I had some good experience for that kind of task using a Skip List
At least in my case it was about 5 times faster compared to adding everything to a list first and then running a sort over it at the end.

Time complexity for Search and Insert operation in sorted and unsorted arrays that includes duplicate values

1-)For sorted array I have used Binary Search.
We know that the worst case complexity for SEARCH operation in sorted array is O(lg N), if we use Binary Search, where N are the number of items in an array.
What is the worst case complexity for the search operation in the array that includes duplicate values, using binary search??
Will it be the be the same O(lg N)?? Please correct me if I am wrong!!
Also what is the worst case for INSERT operation in sorted array using binary search??
My guess is O(N).... is that right??
2-) For unsorted array I have used Linear search.
Now we have an unsorted array that also accepts duplicate element/values.
What are the best worst case complexity for both SEARCH and INSERT operation.
I think that we can use linear search that will give us O(N) worst case time for both search
and delete operations.
Can we do better than this for unsorted array and does the complexity changes if we accepts duplicates in the array.
Yes.
The best case is uninteresting. (Think about why that might be.) The worst case is O(N), except for inserts. Inserts into an unsorted array are fast, one operation. (Again, think about it if you don't see it.)
In general, duplicates make little difference, except for extremely pathological distributions.
Some help on the way - but not the entire solution.
A best case for a binary search is if the item searched for is the first pivot element. The worst case is when having to drill down all the way to two adjacent elements and still not finding what you are looking for. Does this change if there are duplicates in the data? Inserting data into a sorted array includes shuffling away all data with a higher sort order "one step to the right". The worst case is that you insert an item that has lower sort order than any existing item.
Search an unsorted array there is no choice but linear search as you suggest yourself. If you don't care about the sort order there is a much quicker, simpler way to perform the insert. Delete can be thought of as first searching and then removing.
We can do better at deleting from an unordered array! As order doesn't matter in this case we can swap the element to be deleted with the last element which can avoid the unnecessary shifting of the elements in the array. Thus deleting in O(1) time.

Resources