Finding an appropriate data structure - data-structures

I have N keys.
I need to find a data structure which i can do with the following operations :
building it in O(N)
finding min in O(1)
deleting the median in O(logn)
finding the n/2+7-th biggest number
I thought about using a minimum heap (building is O(n),minimum is O(1) - root).
however, I'm having hard time finding a way to do 3 and 4.
I think the median suppose to be on of the leaves, but that's as far as i reached.

A popular question asked in Data Structures 1 exams/hws/tutorials.
I'll try to give you some hints, if they don't suffice, comment, and I'll give you more hints.
Remember that you don't have to use just one data structure, you can use several data structures.
Recall the definition of a median: n/2 of the numbers are larger, and n/2 of the numbers are smaller
What data structures do you know that are built in O(n), and complex operations on them are O(logn) or less? - Reread the tutorials slides on these data structures.
It might be easier for you to solve 1+3 seperately from 1+2, and then think about merging them.

When you say building in O(n), do you mean that addition has to be O(n), or that you have to build a collection of elements in O(n) such that addition has to be O(1)?
You could augment pretty much any data structure with an extra reference to retrieve the minimal element in constant time.
For #3, it sounds like you need to be able to find the median in O(lg n) and delete in O(1), or vice versa.
For #4, you didn't specify the time complexity.
To other posters - this is marked as homework. Please give hints rather than posting the answer.

Simple sorted Array would solve the problem for #2 #3 and #4. But the construction of it would take O(nn). However, there are no restrictions put on space complexity. I am thinking hard to use Hashing concept during the construction of the data structure which would bring down the order to O(n).
Hope this helps. Will get back if I find a better solution

Related

Heapsort and building heaps using linked list

I know that linked list is not a appropriate data structure for building heaps.
One of the answers here (https://stackoverflow.com/a/14584517/5841727) says that heap sort can be done in O(nlogn) using linked list which is same as with arrays.
I think that heapify operation would cost O(n) time in linked list and we would need (n/2) heapify operations leading to time complexity of O(n^2).
Can someone please tell how to achieve O(nlogn) complexity (for heap sort ) using linked list ?
Stackoverflow URL you mentioned is merely someone's claim (at least when I'm writing this) so based on assumption here is my answer. Mostly when people mention "Time complexity", they mean asymptomatic analysis and finding out the proportion to which time taken by algorithm would increase with increasing size of input ignoring all the constants.
To prove the time complexity with linkedlist lets assume there is a function which returns value for given index (i know linked list don't return by index). For efficiency of this function you'd also need to pass in level but we can ignore that for now since it doesn't have any impact on time complexity.
So now it comes down to analyzing time proportion impact on this function with increasing input size. Now you could imagine that for fixing (heapifying) one node you may have to traverse 3 times max (1. find out which one to swap with requires one traverse to compare one of two possible children, 2. going back to parent for swaping 3. coming back down to the one you just swapped). Now even though it may seem that you are doing max n/2 traversal for 3 times; for asymptomatic analysis this is just equals to 'n'. Now you'll have to do this for log n exactly same way how you do for an array. Therefore time-complexity is O(n log n). On wikipedia time-complexity table for heaps URL https://en.wikipedia.org/wiki/Binary_heap#Summary_of_running_times

LCA in binary tree O(n) space O(1) time

I need to find data structure that uses pre-processing on a binary tree and answers the LCA query in O(n) space and O(1) time.
Which data structure helps in this case?
Thank you
There are a couple of data structures that solve this problem. The most common approach is to use a data structure for the range minimum query problem, which can be solved with O(n) preprocessing time and O(1) query time, and then using that to solve LCA. I put together a two-part series of lecture slides on this topic for a class I taught earlier this year.
Part one talks about the range minimum query problem and motivates a couple common solution strategies.
Part two talks about the Fischer-Heun RMQ structure, which meets the requisite time bounds, and how to use it to solve LCA.
For what it's worth, in practice, there are solutions that use O(n) preprocessing time and O(log n) query time but are faster than the equivalent worst-case efficient structures. The ⟨O(n), O(log n)⟩ hybrid described in the first set of lecture slides is, for most practical purposes, good enough.

Given a situation, how to decide on a data structure?

I'm preparing to attend technical interviews and have faced mostly questions which are situation based.Often the situation is a big dataset and I'm asked to decide which will be the most optimal data structure to use.
I'm familiar with most data structures,their implementation and performance. But I fall under dilemma when given situations and be decisive on structures.
Looking for steps/algorithm to follow in a given situation which can help me arrive at the optimum data structure within the time period of the interview.
It depends on what operations you need to support efficiently.
Let's start from the simplest example - you have a large list of elements and you have to find the given element. Lets consider various candidates
You can use sorted array to find an element in O(log N) time using Binary search. What if you want to support insertion and deletion along with that? Inserting an element into a sorted array takes O(n) time in the worst case. (Think of adding an element in the beginning. You have to shift all the elements one place to the right). Now here comes binary search trees (BST). They can support insertion, deletion and searching for an element in O(log N) time.
Now you need to support two operations namely finding minimum and maximum. In the first case, it is just returning the first and the last element respectively and hence the complexity is O(1). Assuming the BST is a balanced one like Red-black tree or AVL tree, finding min and max needs O(log N) time. Consider another situation where you need to return the kth order statistic. Again,sorted array wins. As you can see there is a tradeoff and it really depends on the problem you are given.
Let's take another example. You are given a graph of V vertices and E edges and you have to find the number of connected components in the graph. It can be done in O(V+E) time using Depth first search (assuming adjacency list representation). Consider another situation where edges are added incrementally and the number of connected components can be asked at any point of time in the process. In that situation, Disjoint Set Union data structure with rank and path compression heuristics can be used and it is extremely fast for this situation.
One more example - You need to support range update, finding sum of a subarray efficiently and no new elements are inserting into the array. If you have an array of N elements and Q queries are given, then there are two choices. If range sum queries come only after "all" update operations which are Q' in number. Then you can preprocess the array in O(N+Q') time and answer any query in O(1) time (Store prefix sums). What if there is no such order enforced? You can use Segment Tree with lazy propagation for that. It can be built in O(N log N) time and each query can be performed in O(log N) time. So you need O((N+Q)log N) time in total. Again, what if insertion and deletion are supported along with all these operations? You can use a data structure called Treap which is a probabilistic data structure and all these operations can be performed in O(log N) time. (Using implicit treap).
Note: The constant is omitted while using Big Oh notation. Some of them have large constant hidden in their complexities.
Start with common data structures. Can the problem be solved efficiently with arrays, hashtables, lists or trees (or a simple combination of them, e.g. an array of hastables or similar)?
If there are multiple options, just iterate the runtimes for common operations. Typically one data structure is a clear winner in the scenario set up for the interview. If not, just tell the interviewer your findings, e.g. "A takes O(n^2) to build but then queries can be handled in O(1), whereas for B build and query time are both O(n). So for one-time usage, I'd use B, otherwise A". Space consumption might be relevant in some cases, too.
Highly specialized data structures (e.g. prefix trees aka "Trie") are often that: highly specialized for one particular specific case. The interviewer should usually be more interested in your ability to build useful stuff out of an existing general purpose library -- opposed to knowing all kinds of exotic data structures that may not have much real world usage. That said, extra knowledge never hurts, just be prepared to discuss pros and cons of what you mention (the interviewer may probe whether you are just "name dropping").

Complexity in using Binary search and Trie

given a large list of alphabetically sorted words in a file,I need to write a program that, given a word x, determines if x is in the list. Preprocessing is ok since I will be calling this function many times over different inputs.
priorties: 1. speed. 2. memory
I already know I can use (n is number of words, m is average length of the words)
1. a trie, time is O(log(n)), space(best case) is O(log(nm)), space(worst case) is O(nm).
2. load the complete list into memory, then binary search, time is O(log(n)), space is O(n*m)
I'm not sure about the complexity on tri, please correct me if they are wrong. Also are there other good approaches?
It is O(m) time for the trie, and up to O(mlog(n)) for the binary search. The space is asymptotically O(nm) for any reasonable method, which you can probably reduce in some cases using compression. The trie structure is, in theory, somewhat better on memory, but in practice it has devils hiding in the implementation details: memory needed to store pointers and potentially bad cache access.
There are other options for implementing a set structure - hashset and treeset are easy choices in most languages. I'd go for the hash set as it is efficient and simple.
I think HashMap is perfectly fine for your case, since the time complexity for both put and get operations is O(1). It works perfectly fine even if you dont have a sorted list.!!!
Preprocessing is ok since I will be calling > this function many times over different
inputs.
As a food for thought, do you consider creating a set from the input data and then searching using particular hash? It will take more time process for the first time to build a set but if number of inputs is limited and you may return to them then set might be good idea with O(1) for "contains" operation for a good hash function.
I'd recommend a hashmap. You can find an extension to C++ for this in both VC and GCC.
Use a bloom filter. It is space efficient even for very large data and it is a fast rejection technique.

How do I efficiently keep track of the smallest element in a collection?

In the vein of programming questions: suppose there's a collection of objects that can be compared to each other and sorted. What's the most efficient way to keep track of the smallest element in the collection as objects are added and the current smallest occasionally removed?
Using a min-heap is the best way.
http://en.wikipedia.org/wiki/Heap_(data_structure)
It is tailor made for this application.
If you need random insert and removal, the best way is probably a sorted array. Inserts and removals should be O(log(n)).
#Harpreet
That is not optimal. When an object is removed, erickson will have to scan entire collection to find the new smallest.
You want to read up on Binary search tree's. MS has a good site to start down the path. But you may want to get a book like Introduction to algorithms (Cormen, Leiserson, Rivest, Stein) if you want to deep dive.
For occasional removes a Fibonacci Heap is even faster than the min-heap. Insertion is O(1), and finding the min is also O(1). Removal is O(log(n))
If you need random insert and removal,
the best way is probably a sorted
array. Inserts and removals should be
O(log(n)).
Yes, but you will need to re-sort on each insert and (maybe) each deletion, which, as you stated, is O(log(n)).
With the solution proposed by Harpreet:
you have one O(n) pass in the beginning to find the smallest element
inserts are O(1) thereafter (only 1 comparison needed to the already-known smallest element)
deletes will be O(n) because you will need to re-find the smallest element (keep in mind Big O notation is worst case). You could also optimize by checking to see if the element to be deleted is the (known) smallest, and if not, just don't do any of the re-check to find the smallest element.
So, it depends. One of these algorithms will be better for an insert-heavy use case with few deletes, but the other is overall more consistent. I think I would default to Harpreet's mechanism unless I knew that the smallest number would be removed often, because that exposes a weak point in that algorithm.
Harpreet:
the inserts into that would be linear since you have to move items for an insert.
Doesn't that depend on the implementation of the collection? If it acts like a linked-list, inserts would be O(1), while if it were implemented like an array it would be linear, as you stated.
Depends on which operations you need your container to support. A min-heap is the best if you might need to remove the min element at any given time, although several operations are nontrivial (amortized log(n) time in some cases).
However, if you only need to push/pop from the front/back, you can just use a mindeque which achieves amortized constant time for all operations (including findmin). You can do a scholar.google.com search to learn more about this structure. A friend and I recently collaborated to reach a much easier-to-understand and -to-implement version of a mindeque, as well. If this is what you're looking for I could post the details for you.

Resources