Data structure to hold strings, with certain demands - data-structures

Build a data structure that holds strings, such that:
Create(X1,...Xn) - O(n) complexity. Receives a list of strings and creates the data structure.
Insert(X) - O(log n) complexity. Receives a new strings, and inserts it to the data structure.
Median() - O(log n) complexity. Extracts the ceil(n/2) string, in lexicographic order. Assume each comparison between two strings takes O(1).

Related

how to design a data structure that getMedian and insert in O(1)?

I thought about doing this in sort array and save the index of the median and its takes O(1). but I couldn't think about any way to do the insert in O(1) and keep the array sorted.
I really appreciate it if someone can help me with this problem
What you are asking for is impossible, because it would allow comparison-based sorting in O(n) time:
Suppose you have an unsorted array of length n.
Find the minimum element and maximum element in O(n) time.
Insert all n elements into the data structure, each insertion takes O(1) time so this takes O(n) time.
Insert n-1 extra copies of the minimum element. This also takes O(n) time.
Initialise an output array of length n.
Do this n times:
Read off the median of the elements currently in the data structure, and write it at the next position into the output array. This takes O(1) time.
Insert two copies of the maximum element into the data structure. This takes O(1) time.
The above algorithm supposedly runs in O(n) time, and the result is a sorted array of the elements from the input array. But this is impossible, because comparison-sorting takes Ω(n log n) time. Therefore, the supposed data structure cannot exist.

Binary Search Tree with 2 keys

I have a database of users with their usernames and id's. These are the operations that program will process:
insert, delete (by username), search (by username), print (prints all users info, sorted by their id)
time complexity of first 3 operations shouldn't be more than O(log n) and for print it should be O(n). solution should be implemented with a balanced BST.
My idea to solve the problem is to have to 2 BST, key of one is id and for another is username. So we can access an element by their name or id both in O(log n) time. But this doubles memory space and time of operations.
Is there a way to access elements both by their username and id in O(log n) time in a better way than what i explained?
My idea to solve the problem is to have to 2 BST, key of one is id and
for another is username. So we can access an element by their username or
id both in O(log n) time. But this doubles memory space and time of
operations.
What you propose will indeed double the memory and time requirements for your data structure. (Only insertions and deletions will take double time. The other operations will take no extra time). However, recall that O(2 log n) is generally treated the same as O(log n) and is much less than O(n). As an illustration, I've graphed 2 log n and n. Note that they are equal when n is 2 or 4. log n is essentially a flat line compared to n.
I propose that you cannot do better than this using balanced BSTs (or at all, for that matter). Since you need to search based on username in O(log n) time, username must be the key for the tree. However, you also need to retrieve the users sorted by id in O(n) time. That essentially forbids you from sorting them after retrieving them, because you won't be able to sort them faster than O(n log n). Thus, they must already be sorted by id. Therefore, id must be a key for the tree. Hence, you need two trees.
While 2 trees are fine, you can also use a hash table for lookup and delete plus a sorted index for printing. A red-black tree will be fine for the sorted index.
However, if IDs are consecutive non-negative integers, it will be even more efficient to maintain a simple array, where position i contains the object with the ID of i. Now you can print by just traversing the array. And the hash table values can be IDs, for these "point" to the respective object in the array.

For faser searching, shouldn't one apply merge sort on the data before doing binary search or just jump straight to linear search?

I'm learning about algorithms and have doubts about their application in certain situations. There is the divide and conquer merge sort, and the binary search. Both faster than linear growth algos.
Let's say I want to search for some value in a large list of data. I don't know whether the data is sorted or not. How about instead of doing a linear search, why not first do merge sort and then do binary search. Would that be faster? Or the process of applying merge sort and then binary search combined would slow it down even more than linear search? Why? Would it depend on the size of the data?
There's a flaw in the premise of your question. Merge Sort has O(N logN) complexity, which is the best any comparison-based sorting algorithm can be, but that's still a lot slower than a single linear scan. Note that log2(1000) ~= 10. (Obviously, the constant-factors matter a lot, esp. for smallish problem sizes. Linear search of an array is one of the most efficient things a CPU can do. Copying stuff around for MergeSort is not bad, because the loads and stores are from sequential addresses (so caches and prefetching are effective), but it's still a ton more work than 10 reads through the array.)
If you need to support a mix of insert/delete and query operations, all with good time complexity, pick the right data structure for the task. A binary search tree is probably appropriate (or a Red-Black tree or some other variant that does some kind of rebalancing to prevent O(n) worst-case behaviour). That'll give you O(log n) query, and O(log n) insert/delete.
sorted array gives you O(n) insert/delete (because you have to shuffle the remaining elements over to make or close gaps), but O(log n) query (with lower time and space overhead than a tree).
unsorted array: O(n) query (linear search), O(1) insert (append to the end), O(n) delete (O(n) query, then shuffle elements to close the gap). Efficient deletion of elements near the end.
linked list, sorted or unsorted: few advantages other than simplicity.
hash table: insert/delete: O(1) average (amortized). query for present/not-present: O(1). Query for which two elements a non-present value is between: O(n) linear scan keeping track of the min element greater than x, and max element less than x.
If your inserts/deletes happen in large chunks, then sorting the new batch and doing a merge-sort is much more efficient than adding elements one at a time to a sorted array. (i.e. InsertionSort). Adding a chunk at the end and doing QuickSort is also an option, and might modify less memory.
So the best choice depends on the access pattern you're optimizing for.
If the list is of size n, then
TimeOfMergeSort(list) + TimeOfBinarySearch(list) = O(n log n) + O(log n) = O(n log n)
TimeOfLinearSearch(list) = O(n)
O(n) < O(n log n)
Implies
TimeOfLinearSearch(list) < TimeOfMergeSort(list) + TimeOfBinarySearch(list)
Of course, as mentioned in the comments frequency of sorting and frequency of searching play a huge role in amortized cost.

Number of occurrences of words in a file - Complexity?

Given I have a file which a set of words:
1) If I choose a hash table to store word -> count, what would be the time complexity to find the occurrences of a particular word?
2) How could I return those words alphabetically ordered?
If I chose a hash table, I know that the time complexity for 1) would be O(n) to parse all the words and O(1) to get the count of a particular word.
I fail to see how could I order the hash table and what would be the time complexity. Any help?
A sortable hash map becomes, essentially, a binary tree. In java you can see TreeMap implementing the SortableMap interface with the O(log n) on look-up and insert.
If you want the best theoretical performance you'd use a HashMap with O(1) look-up and insert and then you'd use a bucket/radix sort with O(n) for display/iteration.
In reality using a radix sort on strings will perform worse than a quick sort O(n log n).
Your analysis of (1) is correct.
Most hash table implementations (that I know of) has no implicit ordering.
To get an ordered list you'd have to sort the list (O(n log n)), queries on the list would take O(log n).
You could theoretically define a hash operation and implementation that sorts, but making it well-distributed (for it to be efficient) would be difficult and just sorting would be a lot simpler.
If it's a file containing lots of duplicates, the best idea may be to use hashing first to eliminate duplicates, then iterate through the hash table to get a list of non-duplicates and sort that.
Working with hash tables has two drawbacks 1- They do not store data in sorted way, 2-Calculation of the hash value is usually time consuming. They also have linear complexity for insert/delete/lookup in the worst case.
My suggestion is using a Trie for storing your words. Which has a guaranteed O(1) (number of words) for insert/lookup. A pre-order traverse over a Trie will give a sorted list of the words in the Trie.

Ordered array of strings - worst case?

Would O(log n) be the worst case search time limited to an ordered array of strings of length n?
I just did a test today and i wondering if i'm right or wrong, selecting that out these...
O(n)
O(log n)
O(n/2)
O(√n)
EDIT: I edited this question to make things clearer.
Sequential Search:
Best:O(n) Worst:O(n)
Binary Search:
Best:O(1) Worst:O(log n)
Searching a string in a sorted array of strings would be O(|S| * logn) where |S| is the average length of a string, and n is the number of strings, using binary search, since it has O(logn) compare ops, and each compare is O(|S|) (It has to read the string...)
If you regard the length of the strings as constants - it is O(logn). This assumption is generally not taken when talking about strings.
Note that there are other data structures - such as trie, that allows better complexity. Trie allows O(|S|) search for each string in the collection.
P.S.
Mathematically speaking, since big O notation is an upper bound, and not a tight bound - all of the answers are correct1 for binary search, it is O(n), O(n/2) , O(logn), O(sqrt(n)), since all of them provide assymptotic upper bound for binary search.
(1) assuming binary search and all strings have bounded length, so each compare op is O(1).

Resources