Number of occurrences of words in a file - Complexity? - algorithm

Given I have a file which a set of words:
1) If I choose a hash table to store word -> count, what would be the time complexity to find the occurrences of a particular word?
2) How could I return those words alphabetically ordered?
If I chose a hash table, I know that the time complexity for 1) would be O(n) to parse all the words and O(1) to get the count of a particular word.
I fail to see how could I order the hash table and what would be the time complexity. Any help?

A sortable hash map becomes, essentially, a binary tree. In java you can see TreeMap implementing the SortableMap interface with the O(log n) on look-up and insert.
If you want the best theoretical performance you'd use a HashMap with O(1) look-up and insert and then you'd use a bucket/radix sort with O(n) for display/iteration.
In reality using a radix sort on strings will perform worse than a quick sort O(n log n).

Your analysis of (1) is correct.
Most hash table implementations (that I know of) has no implicit ordering.
To get an ordered list you'd have to sort the list (O(n log n)), queries on the list would take O(log n).
You could theoretically define a hash operation and implementation that sorts, but making it well-distributed (for it to be efficient) would be difficult and just sorting would be a lot simpler.
If it's a file containing lots of duplicates, the best idea may be to use hashing first to eliminate duplicates, then iterate through the hash table to get a list of non-duplicates and sort that.

Working with hash tables has two drawbacks 1- They do not store data in sorted way, 2-Calculation of the hash value is usually time consuming. They also have linear complexity for insert/delete/lookup in the worst case.
My suggestion is using a Trie for storing your words. Which has a guaranteed O(1) (number of words) for insert/lookup. A pre-order traverse over a Trie will give a sorted list of the words in the Trie.

Related

What datastructure is effective for minimizing the cost of look ups in hash table buckets?

Given a hash table with collisions, the generic hash table implementation will cause look ups within a bucket to run in O(n), assuming that a linkedlist is used.
If we switch the linked list for a binary search tree, we go down to O(log n). Is this the best we can do, or is there a better data structure for this use case?
Using hash tables for the buckets themselves would bring the look up time to O(1), but that would require clever revisions of the hash function.
There is trade-off between insertion time to look-up time in your solution. (Keep bucket sorted)
If you want to keep every bucket sorted, you will get O(log n) look-up time using Binary search. However when you insert a new element, you will have to place him in the right location so the bucket will continue be sorted - O(log n) search time for placing new element.
So in your solution, you get total complexity O(log n) for both insertion and look-up.
(In contrast to the traditional solution that take O(n) for look-up in the worst case, and O(1) for insertion)
EDIT :
If you choose to use a sorted bucket, of course you can't use LinkedList any more. You can switch to any other suitable data structure.
Perfect hashing is known to achieve collision-free O(1) hashing of a limited set of keys known at the time the hash function is constructed. The Wikipedia article mensions several aproaches to apply those ideas to a dynamic set of keys, like dynamic perfect hashing and cuckoo hashing, which might be of interest to you.
You've pretty much answered your own question. Since a hash table is just an array of other data structures, your lookup time is just dependent on the lookup time of the secondary data structure and how well your hash function distributes items across the buckets.

find index of element inside a collection, which collection to use?

I have a problem choosing the right data structure/s, these are the requirements:
I must be able to insert and delete elements
I must also be able to get the index of the element in the collection (order in the collection)
Elements has an unique identifier number
I can sort (if necessary) the elements using any criterium
Ordering is not really a must, the important thing is getting the index of the element, no matters how is internally implemented, but anyway I think that the best approach is ordering.
The index of the element is the order inside the collection. So some kind of order has to be used. When I delete an element, the other elements from that to the end change their order/index.
First approach is using a linked list, but I don't want O(n).
I have also thinked about using and ordered dictionary, that would give O(log n) for lookup/insert/delete, isn't it?
Is there a better approach? I know a TRIE would give O(1) for common operations, but I don't see how to get the index of an element, I would have to iterate over the trie and would give O(n), am I wrong?
Sounds like you want an ordered data structure, i.e. a (balanced) BST. Insertion and deletion would indeed be O(lg n), which suffices for many applications. If you also want elements to have an index in the structure, then you'd want an order statistic tree (see e.g., CLR, Introduction to Algorithms, chapter 14) which provides this operation in O(lg n). Dynamically re-sorting the entire collection would be O(n lg n).
If by "order in the collection" you mean any random order is good enough, then just use a dynamic array (vector): amortized O(1) append and delete, O(n lg n) in-place sort, but O(n) lookup until you do the sort, after which lookup becomes O(lg n) with binary search. Deletion would be O(n) if the data is to remain sorted, though.
If your data is string-like, you might be able to extend a trie in the same that a BST is extended to become an order statistic tree.
You don't mention an array/vector here, but it meets most of these criteria.
(Note that "Elements has a unique identifer number" is really irrespective of datastructure; does this mean the same thing as the index? Or is it an immutable key, which is more a function of the data you're putting into the structure...)
There are going to be timing tradeoffs in any scenario: you say linked list is O(n), but O(n) for what? You don't really get into your performance requirements for additions vs. deletes vs. searches; which is more important?
Well if your collection is sorted, you don't need O(n) to find elements. It's possible to use binary search for example to determine index of element. Also it's possible to write simple wrapper about Entry inside your array, which remember its index inside collection.

Fastest data structure for inserting/sorting

I need a data structure that can insert elements and sort itself as quickly as possible. I will be inserting a lot more than sorting. Deleting is not much of a concern and nethier is space. My specific implementation will additionally store nodes in an array, so lookup will be O(1), i.e. you don't have to worry about it.
If you're inserting a lot more than sorting, then it may be best to use an unsorted list/vector, and quicksort it when you need it sorted. This keeps inserts very fast. The one1 drawback is that sorting is a comparatively lengthy operation, since it's not amortized over the many inserts. If you depend on relatively constant time, this can be bad.
1 Come to think of it, there's a second drawback. If you underestimate your sort frequency, this could quickly end up being overall slower than a tree or a sorted list. If you sort after every insert, for instance, then the insert+quicksort cycle would be a bad idea.
Just use one of the self-balanced binary search trees, such as red-black tree.
Use any of the Balanced binary trees like AVL trees. It should give O(lg N) time complexity for both of the operations you are looking for.
If you don't need random access into the array, you could use a Heap.
Worst and average time complexity:
O(log N) insertion
O(1) read largest value
O(log N) to remove the largest value
Can be reconfigured to give smallest value instead of largest. By repeatedly removing the largest/smallest value you get a sorted list in O(N log N).
If you can do a lot of inserts before each sort then obviously you should just append the items and sort no sooner than you need to. My favorite is merge sort. That is O(N*Log(N)), is well behaved, and has a minimum of storage manipulation (new, malloc, tree balancing, etc.)
HOWEVER, if the values in the collection are integers and reasonably dense, you can use an O(N) sort, where you just use each value as an index into a big-enough array, and set a boolean TRUE at that index. Then you just scan the whole array and collect the indices that are TRUE.
You say you're storing items in an array where lookup is O(1). Unless you're using a hash table, that suggests your items may be dense integers, so I'm not sure if you even have a problem.
Regardless, memory allocating/deleting is expensive, and you should avoid it by pre-allocating or pooling if you can.
I had some good experience for that kind of task using a Skip List
At least in my case it was about 5 times faster compared to adding everything to a list first and then running a sort over it at the end.

Is partitioning easier than sorting?

This is a question that's been lingering in my mind for some time ...
Suppose I have a list of items and an equivalence relation on them, and comparing two items takes constant time.
I want to return a partition of the items, e.g. a list of linked lists, each containing all equivalent items.
One way of doing this is to extend the equivalence to an ordering on the items and order them (with a sorting algorithm); then all equivalent items will be adjacent.
But can it be done more efficiently than with sorting? Is the time complexity of this problem lower than that of sorting? If not, why not?
You seem to be asking two different questions at one go here.
1) If allowing only equality checks, does it make partition easier than if we had some ordering? The answer is, no. You require Omega(n^2) comparisons to determine the partitioning in the worst case (all different for instance).
2) If allowing ordering, is partitioning easier than sorting? The answer again is no. This is because of the Element Distinctness Problem. Which says that in order to even determine if all objects are distinct, you require Omega(nlogn) comparisons. Since sorting can be done in O(nlogn) time (and also have Omega(nlogn) lower bounds) and solves the partition problem, asymptotically they are equally hard.
If you pick an arbitrary hash function, equal objects need not have the same hash, in which case you haven't done any useful work by putting them in a hashtable.
Even if you do come up with such a hash (equal objects guaranteed to have the same hash), the time complexity is expected O(n) for good hashes, and worst case is Omega(n^2).
Whether to use hashing or sorting completely depends on other constraints not available in the question.
The other answers also seem to be forgetting that your question is (mainly) about comparing partitioning and sorting!
If you can define a hash function for the items as well as an equivalence relation, then you should be able to do the partition in linear time -- assuming computing the hash is constant time. The hash function must map equivalent items to the same hash value.
Without a hash function, you would have to compare every new item to be inserted into the partitioned lists against the head of each existing list. The efficiency of that strategy depends on how many partitions there will eventually be.
Let's say you have 100 items, and they will eventually be partitioned into 3 lists. Then each item would have to be compared against at most 3 other items before inserting it into one of the lists.
However, if those 100 items would eventually be partitioned into 90 lists (i.e., very few equivalent items), it's a different story. Now your runtime is closer to quadratic than linear.
If you don't care about the final ordering of the equivalence sets, then partitioning into equivalence sets could be quicker. However, it depends on the algorithm and the numbers of elements in each set.
If there are very few items in each set, then you might as well just sort the elements and then find the adjacent equal elements. A good sorting algorithm is O(n log n) for n elements.
If there are a few sets with lots of elements in each then you can take each element, and compare to the existing sets. If it belongs in one of them then add it, otherwise create a new set. This will be O(n*m) where n is the number of elements, and m is the number of equivalence sets, which is less then O(n log n) for large n and small m, but worse as m tends to n.
A combined sorting/partitioning algorithm may be quicker.
If a comparator must be used, then the lower bound is Ω(n log n) comparisons for sorting or partitioning. The reason is all elements must be inspected Ω(n), and a comparator must perform log n comparisons for each element to uniquely identify or place that element in relation to the others (each comparison divides the space in 2, and so for a space of size n, log n comparisons are needed.)
If each element can be associated with a unique key which is derived in constant time, then the lowerbound is Ω(n), for sorting ant partitioning (c.f. RadixSort)
Comparison based sorting generally has a lower bound of O(n log n).
Assume you iterate over your set of items and put them in buckets with items with the same comparative value, for example in a set of lists (say using a hash set). This operation is clearly O(n), even after retreiving the list of lists from the set.
--- EDIT: ---
This of course requires two assumptions:
There exists a constant time hash-algorithm for each element to be partitioned.
The number of buckets does not depend on the amount of input.
Thus, the lower bound of partitioning is O(n).
Partitioning is faster than sorting, in general, because you don't have to compare each element to each potentially-equivalent already-sorted element, you only have to compare it to the already-established keys of your partitioning. Take a close look at radix sort. The first step of radix sort is to partition the input based on some part of the key. Radix sort is O(kN). If your data set has keys bounded by a given length k, you can radix sort it O(n). If your data are comparable and don't have a bounded key, but you choose a bounded key with which to partition the set, the complexity of sorting the set would be O(n log n) and the partitioning would be O(n).
This is a classic problem in data structures, and yes, it is easier than sorting. If you want to also quickly be able to look up which set each element belongs to, what you want is the disjoint set data structure, together with the union-find operation. See here: http://en.wikipedia.org/wiki/Disjoint-set_data_structure
The time required to perform a possibly-imperfect partition using a hash function will be O(n+bucketcount) [not O(n*bucketcount)]. Making the bucket count large enough to avoid all collisions will be expensive, but if the hash function works at all well there should be a small number of distinct values in each bucket. If one can easily generate multiple statistically-independent hash functions, one could take each bucket whose keys don't all match the first one and use another hash function to partition the contents of that bucket.
Assuming a constant number of buckets on each step, the time is going to be O(NlgN), but if one sets the number of buckets to something like sqrt(N), the average number of passes should be O(1) and the work in each pass O(n).

Time complexity for Search and Insert operation in sorted and unsorted arrays that includes duplicate values

1-)For sorted array I have used Binary Search.
We know that the worst case complexity for SEARCH operation in sorted array is O(lg N), if we use Binary Search, where N are the number of items in an array.
What is the worst case complexity for the search operation in the array that includes duplicate values, using binary search??
Will it be the be the same O(lg N)?? Please correct me if I am wrong!!
Also what is the worst case for INSERT operation in sorted array using binary search??
My guess is O(N).... is that right??
2-) For unsorted array I have used Linear search.
Now we have an unsorted array that also accepts duplicate element/values.
What are the best worst case complexity for both SEARCH and INSERT operation.
I think that we can use linear search that will give us O(N) worst case time for both search
and delete operations.
Can we do better than this for unsorted array and does the complexity changes if we accepts duplicates in the array.
Yes.
The best case is uninteresting. (Think about why that might be.) The worst case is O(N), except for inserts. Inserts into an unsorted array are fast, one operation. (Again, think about it if you don't see it.)
In general, duplicates make little difference, except for extremely pathological distributions.
Some help on the way - but not the entire solution.
A best case for a binary search is if the item searched for is the first pivot element. The worst case is when having to drill down all the way to two adjacent elements and still not finding what you are looking for. Does this change if there are duplicates in the data? Inserting data into a sorted array includes shuffling away all data with a higher sort order "one step to the right". The worst case is that you insert an item that has lower sort order than any existing item.
Search an unsorted array there is no choice but linear search as you suggest yourself. If you don't care about the sort order there is a much quicker, simpler way to perform the insert. Delete can be thought of as first searching and then removing.
We can do better at deleting from an unordered array! As order doesn't matter in this case we can swap the element to be deleted with the last element which can avoid the unnecessary shifting of the elements in the array. Thus deleting in O(1) time.

Resources