Confusion about Hash Map vs Trie time complexity - algorithm

Let's say we're comparing the time complexity of search function in hashmap vs trie.
On a lot of resources I can find, the time complexities are described as
Hashmap get: O(1)
vs
Trie search: O(k) where k is the length of chars in the string you want to search.
However, I find this a bit confusing. To me, this looks like the sample size "n" is defined differently in the two scenarios.
If we define n as the number of characters, and thus are interested in what's the complexity of this algorithm as the number of characters grow to infinity, wouldn't hashmap get also have a time complexity of O(k) due to its hash function?
On the other hand, if we define n as the number of words in the data structure, wouldn't the time complexity of Trie search also be O(1) since the search of the word doesn't depend on the number of words already stored in the Trie?
In the end, if we're doing an apple to apple comparison of time complexity, it looks to me like the time complexity of Hashmap get and Trie search would be the same.
What am I missing here?

Yes, you are absolutely correct.
What you are missing is that statements about an algorithm's complexity can be based on whatever input terms you like. Outside of school, such statements are made to communicate, and you can make them to communicate whatever you want.
It's important to make sure that you are understood, though, so if there is a chance for confusion about how the n in O(n) is measured, or any assumed constraints on the input (like bounded string size), then you should just specify that explicitly.

Related

Why doesn't the complexity of hashing and BSTs factor in the time required to process the bytes of the keys?

I have a basic question on the time complexity of basic operations when using hash tables as opposed to binary search trees (or balanced ones).
In basic algorithm courses, which is unfortunately the only type I have studies, I learned that ideally, the time complexity of look-up/insert using Hashtables is O(1). For binary (search) trees, it is O(log(n)) where "n" is the "number" of input objects. So far, hashtable is the winner (I guess) in terms of asymptotic access time.
Now take "n" as the size of the data structure array, and "m" as the number of distinct input objects (values) to be stored in the DS.
For me, there is an ambiguity in the time complexity of data structure operations (e.g., lookup). Is it really possible to do Hashing with a "calculation/evaluation" complexity constant time in "n"? Specifically, if we know we have "m" distinct values for the objects which are being stored, then can the hash function still run faster than "Omega (log(m))"?
If not, then I would claim that the complexity for nontrivial applications has to be O( log ( n ) ) since in practice "n" and "m" are not drastically different.
I can't see a way such function can be found. For example, take m= 2^O( k) be the total number of distinct strings of length "k" bytes. A hash function has to go over all "k" bytes and even if it takes only constant time to do the calculations for each byte, then the overall time needed to hash the input is Omega( k ) = Omega( log( m) ).
Having said that, for cases where the number of potential inputs is comparable to the size of the table, e.g., "m" is almost equal to "n", the hashing complexity does not look like constant time to me.
Your concern is valid, though I think there's a secondary point you're missing. If you factor in the time required to look through all the bytes of the input into the calculation of the time complexity of a BST, you would take the existing O(log n) time and multiply it by the time required for each comparison, which would be O(log m). You'd then get O(log n log m) time for searches in a BST.
Typically, the time complexities states for BSTs and hash tables are not the real time complexities, but the number of "elementary operations" on the underlying data types. For example, a hash table does, on expectation, O(1) hashes and comparisons of the underlying data types. A BST will do O(log n) comparisons of the underlying data types. If those comparisons or hashes don't take time O(1), then the time required to do the lookups won't be O(1) (for hash tables) or O(log n) (for BSTs).
In some cases, we make assumptions about how the machine works that let us conveniently ignore the time required to process the bits of the input. For example, suppose that we're hashing numbers between 0 and 2k. If we assume that we have a transdichotomous machine model, then by assumption each machine word will be at least Ω(k) bits and we can perform operations on machine words in time O(1). This means that we can perform hashes on k bits in time O(1) rather than time O(k), since we're assuming that the word size grows as a function of the problem set.
Hope this helps!
That's a fair point. If your container's keys are arbitrarily large objects, you need a different analysis. However, in the end the result will be roughly the same.
In classic algorithmic analysis, we usually just assume that certain operations (like incrementing a counter, or comparing two values) take constant time, and that certain objects (like integers) occupy constant space. These two assumptions go hand in hand, because when we say that an algorithm is O(f(N)), the n refers to "the 'size' of the problem", and if individual components of the problem have non-constant size, then the total size of the problem will have an additional non-constant multiplier.
More importantly, we generally make the assumption that we can index a contiguous array in constant time; this is the so-called "RAM" or "von Neumann" model, and it underlies most computational analysis in the last four decades or so (see here for a potted history).
For simple problems, like binary addition, it really doesn't matter whether we count the size of the objects as 1 object or k bits. In either case, the cost of doing a set of additions of size n is O(n), whether we're counting objects-of-a-constant-size or bits in variable-size-objects. By the same token, the cost of a hash-table lookup consists of:
Compute the hash (time proportional to key size)
Find the hash bucket (assumed to be constant time since the hash is a fixed size)
Compare the target with each object in the bucket (time proportional to key size, assuming that the bucket length is constant)
Similarly, we usually analyze the cost of a binary search by counting comparisons. If each object takes constant space, and each comparison takes constant time, then we can say that a problem of size N (which is n objects multiplied by some constant) can be solved with a binary search tree in log n comparisons. Again, the comparisons might take non-constant time, but then the problem size will also be multiplied by the same constant.
There is a lengthy discussion on a similar issue (sorting) in the comments in this blog post, also from the Computational Complexity blog, which you might well enjoy if you're looking for something beyond the basics.

Algorithm to find the top 3 occurring words in a book of 1000 pages [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
Algorithm to find the top 3 occurring words in a book of 1000 pages. Is there a better solution than using a hashtable?
A potentially better solution is to use a trie-based dictionary. With a trie, you can perform the task in worst-case O(n × N) time where N is the number of words and n is their average length. The difference with a hash table is that the complexity for a trie is independent of any hash function or the book's vocabulary.
There's no way to do better than O(n × N) for arbitrary input since you'll have to scan through all the words.
It is strange, that everybody concentrated on going through the word list and forgot about the main issue - taking k most frequent items. Actually, hash map is good enough to count occurrences, but this implementation still needs sorting, which is de facto O(n*logn) (in best case).
So, hash map implementation needs 1 pass to count words (unguaranteed O(n)) and O(n*logn) to sort it. Tries mentioned here may be better solution for counting, but sorting is still the issue. And again, 1 pass + sorting.
What you actually need is a heap, i.e. tree-based data structure that keeps largest (lowest) elements close to root. Simple implementations of a heap (e.g binary heap) need O(logn) time to insert new elements and O(1) to get highest element(s), so resulting algorithm will take O(n*logn) and only 1 pass. More sophisticated implementations (e.g. Fibonacci heap) take amortized O(1) time for insertion, so resulting algorithm takes O(n) time, which is better than any suggested solution.
You're going to have to go through all of the pages word by word to get an exact answer.
So a linked list implementation that also uses a hashtable interface to store pointers to nodes of the linked list, would do very well.
You need the linked list to grow dynamically and the hashtable to quickly get access to the right needed node so you can update the count.
A simple approach is to use a Dictionary<string, int>(.net) or HashTable and count the occurance of each word while scanning the whole book.
Wikipedia says this:
"For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case."
I would also have guessed a hash tree.
This algorithm solves with a complexity of
n+lg(n)-2 whener n = 3 here.
http://www.seeingwithc.org/topic3html.html

What is big-O notation? How do you come up with figures like O(n)? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Plain english explanation of Big O
I'd imagine this is probably something taught in classes, but as I a self-taught programmer, I've only seen it rarely.
I've gathered it is something to do with the time, and O(1) is the best, while stuff like O(n^n) is very bad, but could someone point me to a basic explanation of what it actually represents, and where these numbers come from?
Big O refers to the worst case run-time order. It is used to show how well an algorithm scales based on the size of the data set (n->number of items).
Since we are only concerned with the order, constant multipliers are ignored, and any terms which increase less quickly than the dominant term are also removed. Some examples:
A single operation or set of operations is O(1), since it takes some constant time (does not vary based on data set size).
A loop is O(n). Each element in the data set is looped over.
A nested loop is O(n^2). A nested nested loop is O(n^3), and onward.
Things like binary tree searching are log(n), which is more difficult to show, but at every level in the tree, the possible number of solutions is halved, so the number of levels is log(n) (provided the tree is balanced).
Something like finding the sum of a set of numbers that is closest to a given value is O(n!), since the sum of each subset needs to be calculated. This is very bad.
It's a way of expressing time complexity.
O(n) means for n elements in a list, it takes n computations to sort the list. Which isn't bad at all. Each increase in n increases time complexity linearly.
O(n^n) is bad, because the amount of computation required to perform a sort (or whatever you are doing) will exponentially increase as you increase n.
O(1) is the best, as it means 1 computation to perform a function, think of hash tables, looking up a value in a hash table has O(1) time complexity.
Big O notation as applied to an algorithm refers to how the run time of the algorithm depends on the amount of input data. For example, a sorting algorithm will take longer to sort a large data set than a small data set. If for the sorting algorithm example you graph the run time (vertical-axis) vs the number of values to sort (horizontal-axis), for numbers of values from zero to a large number, the nature of the line or curve that results will depend on the sorting algorithm used. Big O notation is a shorthand method for describing the line or curve.
In big O notation, the expression in the brackets is the function that is graphed. If a variable (say n) is included in the expression, this variable refers to the size of the input data set. You say O(1) is the best. This is true because the graph f(n) = 1 does not vary with n. An O(1) algorithm takes the same amount of time to complete regardless of the size of the input data set. By contrast, the run time of an algorithm of O(n^n) increases with the square of the size of the input data set.
That is the basic idea, for a detailed explanation, consult the wikipedia page titled 'Big O Notation'.

Complexity in using Binary search and Trie

given a large list of alphabetically sorted words in a file,I need to write a program that, given a word x, determines if x is in the list. Preprocessing is ok since I will be calling this function many times over different inputs.
priorties: 1. speed. 2. memory
I already know I can use (n is number of words, m is average length of the words)
1. a trie, time is O(log(n)), space(best case) is O(log(nm)), space(worst case) is O(nm).
2. load the complete list into memory, then binary search, time is O(log(n)), space is O(n*m)
I'm not sure about the complexity on tri, please correct me if they are wrong. Also are there other good approaches?
It is O(m) time for the trie, and up to O(mlog(n)) for the binary search. The space is asymptotically O(nm) for any reasonable method, which you can probably reduce in some cases using compression. The trie structure is, in theory, somewhat better on memory, but in practice it has devils hiding in the implementation details: memory needed to store pointers and potentially bad cache access.
There are other options for implementing a set structure - hashset and treeset are easy choices in most languages. I'd go for the hash set as it is efficient and simple.
I think HashMap is perfectly fine for your case, since the time complexity for both put and get operations is O(1). It works perfectly fine even if you dont have a sorted list.!!!
Preprocessing is ok since I will be calling > this function many times over different
inputs.
As a food for thought, do you consider creating a set from the input data and then searching using particular hash? It will take more time process for the first time to build a set but if number of inputs is limited and you may return to them then set might be good idea with O(1) for "contains" operation for a good hash function.
I'd recommend a hashmap. You can find an extension to C++ for this in both VC and GCC.
Use a bloom filter. It is space efficient even for very large data and it is a fast rejection technique.

Run time to insert n elements into an empty hash table

People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."
So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.
For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:
T(n) = T(n/2) + n/2 + n/2
(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?
It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.
People say it takes amortized O(1) to put into a hash table.
From a theoretical standpoint, it is expected amortized O(1).
Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).
You can achieve expected amortized O(1) using dynamic perfect hashing:
The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.
The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.
In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.
All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.
In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.
In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.
Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.
It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.

Resources