Big O Notation Arrays vs. Linked List insertions - data-structures

Big O Notation Arrays vs. Linked List insertions:
According to academic literature for arrays it is constant O(1) and for Linked Lists it is linear O(n).
An array only takes one multiplication and addition.
A linked list which is not laid out in contiguous memory requires traversal.
This question is, does O(1) and O(n) accurately describe indexing/search costs for arrays and linked lists respectively?

O(1) accurately describes inserting at the end of the array. However, if you're inserting into the middle of an array, you have to shift all the elements after that element, so the complexity for insertion in that case is O(n) for arrays. End appending also discounts the case where you'd have to resize an array if it's full.
For linked list, you have to traverse the list to do middle insertions, so that's O(n). You don't have to shift elements down though.
There's a nice chart on wikipedia with this: http://en.wikipedia.org/wiki/Linked_list#Linked_lists_vs._dynamic_arrays
Linked list Array Dynamic array Balanced tree
Indexing Θ(n) Θ(1) Θ(1) Θ(log n)
Insert/delete at beginning Θ(1) N/A Θ(n) Θ(log n)
Insert/delete at end Θ(1) N/A Θ(1) amortized Θ(log n)
Insert/delete in middle search time
+ Θ(1) N/A Θ(n) Θ(log n)
Wasted space (average) Θ(n) 0 Θ(n)[2] Θ(n)

Assuming you are talking about an insertion where you already know the insertion point, i.e. this does not take into account the traversal of the list to find the correct position:
Insertions in an array depend on where you are inserting, as you will need to shift the existing values. Worst case (inserting at array[0]) is O(x).
Insertion in a list is O(1) because you only need to modify next/previous pointers of adjacent items.

Insertion for arrays I'd imagine is slower. Sure, you have to iterate a linked list, but you have to allocate, save and deallocate memory to insert into an array.

What literature are you referencing? The size of an array is determined when the array is created, and never changes afterwards. Inserting really only can take place on free slots at the end of the array. Any other type of insertion may require resizing and this is certainly not O(1). The size of a linked list is implementation dependent, but must always be at least big enough to store all of its elements. Elements can be inserted anywhere in the list, and finding the appropriate index requires traversing.

tldr An unsorted array is analogous to a set. Like a set, elements can be added and removed, iterated over, and read. But, as with a set, it makes no sense to talk about inserting an element at a specific position, because to do so would be an attempt to impose a sorting order on what is, by definition, unsorted.
According to academic literature for arrays it is constant O(1) and for Linked Lists it is linear O(n).
It is worth understanding why the academic literature quotes array insert as O(1) for an array. There are several concepts to understand:
An array is defined as being unsorted (unless explicitly stated otherwise).
The length of an array, defined as the number of elements that the array contains, can be increased or decreased arbitrarily in O(1) time and there is no limit on the maximum size of an array.
(In a real computer this is not be the case, due to various factors such as memory size, virtual memory, swap space, etc. But for the purpose of algorithm asymptotic analysis these factors are not important - we care about how the running time of the algorithm changes as the input size increases towards infinity, not how it performs on a particular computer with a particular memory size and operating system.)
Insert and delete are O(1) because the array is an unsorted data structure.
Insert is not assignment
Consider what it actually means to add an element to an unsorted data structure. Since there is no defined sorting order, whatever order actually occurs is arbitrary and does not matter. If you think in terms of an object oriented API, the method signature would be something like:
Array.insert(Element e)
Note that this is the same as the insert methods for other data structures, like a linked list or sorted array:
LinkedList.insert(Element e)
SortedArray.insert(Element e)
In all of these cases, the caller of the insert method does not specify where the value being inserted ends up being stored - it is an internal detail of the data structure. Furthermore, it makes no sense for the caller to try and insert an element at a specific location in the data structure - either for a sorted or unsorted data structure. For an (unsorted) linked list, the list is by definition unsorted and therefore the sort order is irrelevant. For a sorted array, the insert operation will, by definition, insert an element at a specific point of the array.
Thus it makes no sense to define an array insert operation as:
Array.insert(Element e, Index p)
With such a definition, the caller would override an internal property of the data structure and impose an ordering constraint on an unsorted array - a constraint that does not exist in the definition of the array, because an array is unsorted.
Why does this misconception occur with arrays and not other data structures? Probably because programmers are used to dealing with arrays using the assignment operator:
array[0] = 10
array[1] = 20
The assignment operator gives the values of an array an explicit order. The important thing to note here is that assignment is not the same as insert:
insert : store the given value in the data structure without modifying existing elements.
insert in unsorted : store the given value in the data structure without modifying existing elements and the retrieval order is not important.
insert in sorted : store the given value in the data structure without modifying existing elements and the retrieval order is important.
assign a[x] = v : overwrite the existing data in location x with the given value v.
An unsorted array has no sort order, and hence insert does not need to allow overriding of the position. insert is not the same thing as assignment. Array insert is simply defined as:
Array.insert(v):
array.length = array.length + 1
// in standard algorithmic notation, arrays are defined from 1..n not 0..n-1
array[array.length] = v
Which is O(1).

Long ago on a system that had more RAM that disk space I implemented an indexed linked list that that was indexed as it was entered by hand or as it was loaded from disk. Each and every record was append to the next index in memory and the disk file opened the record appended to the end closed.
The program cashiered auction sales on a Model I Radio Shack computer and the the writes to disk were only insurance against power failure and for and archived record as in order to meet time constraints the data had to be fetched form RAM and printed in reverse order so the buyer could be ask if the first item that came up was the last one he purchased. Each buyer and Seller were linked to the last item of theirs that sold and that item was linked to the item before it. It was only a single link link list that was traversed from the bottom up.
Corrections were made with reversing entries. I used the same method for several things and I never found a faster system if the method would work for the job at hand and the index was saved to disk and didn't have to be rebuilt as the file reloaded to memory as it might in a power failure.
Later I wrote a program to edit more conventionally. It could also reorganize the data so it was grouped together.

Related

Time Complexity of Hash Map Traversal

What is the best, average and worst case time complexity for traversing a hash map under the assumption that the hash map uses chaining with linked lists.
I've read multiple times that the time complexity is O(m+n) for traversal for all three cases (m=number of buckets, n=number of elements). However, this differs from my time complexity analysis: In the worst case all elements are linearly chained in the last bucket which leads to a time complexity of O(m+n). In the best case no hash collisions happen and therefore time complexity should be O(m). In the average case I assume that the elements are uniformly distributed, i.e. each bucket on average has n/m elements. This leads to a time complexity of O(m * n/m) = O(n). Is my analysis wrong?
In practice, a good implementation can always achieve O(n). GCC's C++ Standard Library implementation for the hash table containers unordered_map and unordered_set, for example, maintains a forward/singly linked list between the elements inserted into the hash table, wherein elements that currently hash to the same bucket are grouped together in the list. Hash table buckets contain iterators into the singly-linked list for the point where the element before that bucket's colliding elements start (so if erasing an element, the previous link can be rewired to skip over it).
During traversal, only the singly-linked list need be consulted - the hash table buckets are not visited. This becomes especially important when the load factor is very low (many elements were inserted, then many were erased, but in C++ the table never reduces size, so you can end up with a very low load factor.
IF instead you have a hash table implementation where each bucket literally maintains a head pointer for its own linked list, then the kind of analysis you attempted comes into play.
You're right about worst case complexity.
In the best case no hash collisions happen and therefore time complexity should be O(m).
It depends. In C++ for example, values/elements are never stored in the hash table buckets (which would waste a huge amount of memory if the values were large in size and many buckets were empty). If instead the buckets contain the "head" pointer/iterator for the list of colliding elements, then even if there's no collision at a bucket, you still have to follow the pointer to a distinct memory area - that's just as bothersome as following a pointer between nodes on the same linked list, and is therefore normally included in the complexity calculation, so it's still O(m + n).
In the average case I assume that the elements are uniformly
distributed, i.e. each bucket on average has n/m elements.
No... elements being uniformly distributed across buckets is the best case for a hash table: see above. An "average" or typical case is where there's more variation in the number of elements hashing to any given bucket. For example, if you have 1 million buckets and 1 million values and a cryptographic strength hash function, you can statistically expect 1/e (~36.8%) buckets to be empty, 1/1!e (simplifies to 1/1e) buckets to have 1 element, 1/2!e (~18.4%) buckets to have 2 colliding elements, 1/3!e (~6.1%) buckets to have 3 colliding elements and so on (the "!" is for factorial...).
Anyway, the key point is that a naive bucket-visiting hash table traversal (as distinct from actually being able to traverse a list of elements without bucket-visiting), always has to visit all the buckets, then if you imagine each element being tacked onto a bucket somewhere, there's always one extra link to traverse to reach it. Hence O(m+n).

Why "delete" operation is considered to be "slow" on a sorted array?

I am currently studying algorithms and data structures with the help of the famous Stanford course by Tim Roughgarden. In video 13-1 when explaining Balanced Binary Search Trees he compared them to sorted arrays and mentioned that we do not do deletion on sorted array because it is too slow (I believe he meant "slow in comparison with other operations, that we can run in constant [Select, Min/Max, Pred/Succ], O(log n) [Search, Rank] and O(n) [Output/print] time").
I cannot stop thinking about this statement. Namely I cannot wrap my mind around the following:
Let's say we are given an order statistic or a value of the item we
want to delete from a sorted (ascending) array.
We can most certainly find its position in array using Select or
Search in constant or O(n) time respectively.
We can then remove this item and iterate over the items to the right
of the deleted one, incrementing their indices by one, which will take
O(n) time. [this is me (possibly unsuccessfully) trying to describe
the 'move each of them 1 position to the left' operation]
The whole operation will take linear time - O(n) - in the worst case
scenario.
Key question - Am I thinking in a wrong way? If not, why is it considered slow and undesirable?
You are correct: deleting from an array is slow because you have to move all elements after it one position to the left, so that you can cover the hole you created.
Whether O(n) is considered slow depends on the situation. Deleting from an array is most likely part of a larger, more complex algorithm, e.g. inside a loop. This then would add a factor of n to your final complexity, which is usually bad. Using a tree would only add a factor of log n, and O(n log n) is much better than O(n^2) (asymptotically).
The statement is relative to the specific data structure which is being used to hold the sorted values: A sorted array. This specific data structure would be selected for simplicity, for efficient storage, and for quick searches, but is slow for adding and removing elements from the data structure.
Other data structures which hold sorted values may be selected. For example, a binary tree, or a balanced binary tree, or a trie. Each has different characteristics in terms of operation performance and storage efficiency, and would be selected based on the intended usage.
A sorted array is slow for additions and removals because, on average, these operations require shifting half of the array to make room for a new element (or, respectively, to fill in an emptied cell).
However, on many architectures, the simplicity of the data structure and the speed of shifting means that the data structure is fine for "small" data sets.

Data structure with fast sorted insertion, sorted deletion and lookup

I'm looking for a very specific data structure. Suppose the maximum number of elements is known. All elements are integers. Duplicates are allowed.
The operations are:
Look up. If I inserted n elements, a[0] is the smallest elements, a[a.length - 1] is the highest. a[k] is the k smallest element. Required runtime: O(1)
Insertion. Does a sorted insertion, insert(b) where b is an integer. Required runtime: O(log n)
Deletion. delete(i) deletes the ith element. Required runtime: O(log n)
I want kind of data structure is this? My question is language independent, but I'm coding in C++.
I believe such data structure does not exist. Constant lookup for any element (e.g. indexing) requires contiguous memory, which makes insertion impossible to do in less than O(n) if you want to keep the range sorted.
There are arguments for hash tables and wrappers around hash tables, but there are two things to keep in mind when mentioning them:
Hash tables have average access (insertion, deletion, find) in O(1), but that assumes very few hash collisions. If you wish to meet the requirements in regard to pessimistic complexities, hash tables are out of the quesion since their pessimistic access time is O(n).
Hash tables are, by their nature, unordered. They, most of the time, do have internal arrays for storing (for example) buckets of data, but the elements themselves are neither all in contiguous memory nor ordered by some property (other than maybe modulo of their hash, which itself should produce very different values for similar objects).
To not leave you empty handed - if you wish to get as close to your requirements as possible, you need to specify which complexities would you sacrifice to achieve the others. I'd like to propose either std::priority_queue or std::multiset.
std::priority_queue provides access only to the top element - it is guaranteed that it will be either the smallest or the greatest (depending on the comparator you use to specify the relation) element in the collection. Insertion and deletion are both achieved in O(log_n) time.
std::multiset* provides access to every element inside it, but at a higher cost - O(log_n). It also achieves O(log_n) in insertion and deletion.
*Careful - do not confuse with std::set, which does not allow for duplicate elements.

How does merge sort have space complexity O(n) for worst case?

O(n) complexity means merge sort in worst case takes a memory space equal to the number of elements present in the initial array. But hasn't it created new arrays while making the recursive calls? How that space is not counted?
A worst case implementation of top down merge sort could take more space than the original array, if it allocates both halves of the array in mergesort() before making the recursive calls to itself.
A more efficient top down merge sort uses an entry function that does a one time allocation of a temp buffer, passing the temp buffer's address as a parameter to one of a pair of mutually recursive functions that generate indices and merge data between the two arrays.
In the case of a bottom up merge sort, a temp array 1/2 the size of the original array could be used, merging both halves of the array, ending up with the first half of data in the temp array, and the second half in the original array, then doing a final merge back into the original array.
However the space complexity is O(n) in either case, since constants like 2 or 1/2 are ignored for big O.
MergeSort has enough with a single buffer of the same size as the original array.
In the usual version, you perform a merge from the array to the extra buffer and copy back to the array.
In an advanced version, you perform the merges from the array to the extra buffer and conversely, alternately.
Note: This answer is wrong, as was pointed out to me in the comments. I leave it here as I believe it is helpful to most people who wants to understand these things, but remember that this algorithm is actually called in-place mergesort and can have a different runtime complexity than pure mergesort.
Merge sort is easy to implement to use the same array for everything, without creating new arrays. Just send the bounds in each recursive call. So something like this (in pseudocode):
mergesort(array) ->
mergesort'(array, 0, length of array - 1)
mergesort'(array, start, end) ->
mergesort'(array, start, end/2)
mergesort'(array, end/2+1, end)
merge(array, start, end/2, end/2+1, end)
merge(array, start1, end1, start2, end2) ->
// This function merges the two partitions
// by just moving elements inside array
In Merge Sort, space complexity is always omega(n) as you have to store the elements somewhere. Additional space complexity can be O(n) in an implementation using arrays and O(1) in linked list implementations. In practice implementations using lists need additional space for list pointers, so unless you already have the list in memory it shouldn't matter. edit if you count stack frames, then it's O(n)+ O(log n) , so still O(n) in case of arrays. In case of lists it's O(log n) additional memory.
That's why in merge-sort complexity analysis people mention 'additional space requirement' or things like that. It's obvious that you have to store the elements somewhere, but it's always better to mention 'additional memory' to keep purists at bay.

Data structure for efficiently returning the top-K entries of a hash table (map, dictionary)

Here's a description:
It operates like a regular map with get, put, and remove methods, but has a getTopKEntries(int k) method to get the top-K elements, sorted by the key:
For my specific use case, I'm adding, removing, and adjusting a lot of values in the structure, but at any one time there's approximately 500-1000 elements; I want to return the entries for the top 10 keys efficiently.
I call the put and remove methods many times.
I call the getTopKEntries method.
I call the put and remove methods some more times.
I call the getTopKEntries method.
...
I'm hoping for O(1) get, put, and remove operations, and for getTopKEntries to be dependent only on K, not on the size of the map.
So what's a data structure for efficiently returning the top-K elements of a map?
My other question is similar, but is for the case of returning all elements of a map, sorted by the key.
If it helps, both the keys and values are 4-byte integers.
A binary search tree (i.e. std::map in C++) sounds like the perfect structure: it’s already lexicographically ordered, i.e. a simple in-order traversal will yield the elements in ascending order. Hence, iterating over the first k elements will yield the top k elements directly.
Additionally, since you foresee a lot of “remove” operations, a hash table won’t be well-suited anyway: remove operations destroy the load factor characteristics of hash tables which leads to a rapid deterioration of the runtime.
I'm not sure I accept fully Konrad's view that lots of remove operations would destroy the structure of a hash table.
Without remove operations, you could keep all the objects in a hash table, and keep the top K in a priority heap that'd be incrementally updated. This would make insert O(1 + log K), i.e. constant-time in N assuming K is constant and not dependent on N (N = number of objects in the table). However this doesn't work when you have the remove operation available. The proposed Fibonacci heap has O(log N) amortized delete operation so it doesn't give a good solution either, as all the objects would be need to kept in the heap, and if you eventually remove every object that you insert, you get O(log N) behavior in general per a pair of insert+delete.
I would maybe try the following approach:
Store the objects in a hash table, assuming you need the whole table for some other purposes than returning the top objects. Maintain a priority heap (standard heap goes) that contains K * C objects for C whose value you need to search for experimentally. Whenever you add a new object, try to insert it in the heap; if it fits in the KC space (heap is not at capacity yet or it pushes another object away), insert it and set a bit in the hash table to denote that the object is in the heap; when you push an object out of the heap, clear the bit. When you remove an object, check the bit; if the bit=1 i.e. the object was in the heap remove it from there (you need to search for it unless you have a pointer to it from the hash table; it's best to maintain the pointer). What happens now is that the heap shrinks. They key thing is that as long as the heap has still at least K objects it is guaranteed to contain all the top K objects. This is where the factor C comes in as it provides the "leeway" for the heap. When the heap size drops BELOW K, you run a linear scan over the whole hash table and fill the heap back to KC capacity.
Setting C is empirical because it depends on how your objects come and go; but tuning it should be easy as you can tune it just based on runtime profiling.
Complexity: Insert is O(1 + log (KC)). Remove is O(1 + p log (KC) + q N) where p is the probability that a removed object was in the heap, and q is the probability that the heap needs to be rebuilt. p is dependent on the characteristics of how objects come and go. For a simple analysis we can set p=(KC/N), i.e. assume uniform probability. q is even more sensitive to the "flow" of the objects. For example, if new objects in general increase in their value over time and you always delete older objects, q tends towards zero.
Note that funnily enough p is inversely proportional to N so actually this part speeds up when N grows :)
An alternative would be just to sort the items.
In your usage scenario there are only 1000 items – sorting them is just incredibly fast (keep in mind that log2 1000 ≈ 10 = almost 1), and it seems not to occur too often.
You can even adapt the selection algorithm to return the K smallest items. Unfortunately, this will still depend on n, not only on k as you’d hoped for: O(n + k log k).
(I’ve added this as a new answer because it’s actually completely unrelated to my first entry.)
I would recommend a fibonacci heap.
You might want a heap (although deletion may be a problem).
Unless I'm being severely un-creative today, you just can't do it all at O(1).
If you are maintaining a sort order, then adds and deletes will probably be at O(log n). If you are not, then your search will have to be O(n).
Hash tables just don't do sorting. I suggest you live with the O(log n) for inserts and deletes and use one of the suggested data structures (Heap is probably the best). If you need O(1) lookups you could combine a hash, but then you are maintaining two data structures in parallel and might as well use a TreeMap.
If the sort key is a simple integer or decimal number, a trie will be quite fast. It will use up memory, and technically finding an element in a trie is O(log n). But in practice it'll be something like log256 n, so the constant factor is very small (log256 of 2 billion = 4).
I feel, heap is the best data structure for this problem. Because, put,remove and return K top elements can be returned in O(klog(N)) time. Use a max-heap if you want max elements.
Here, I am assuming that k top elements means, you need the k elements having max value.

Resources