Complexity in using Binary search and Trie - algorithm

given a large list of alphabetically sorted words in a file,I need to write a program that, given a word x, determines if x is in the list. Preprocessing is ok since I will be calling this function many times over different inputs.
priorties: 1. speed. 2. memory
I already know I can use (n is number of words, m is average length of the words)
1. a trie, time is O(log(n)), space(best case) is O(log(nm)), space(worst case) is O(nm).
2. load the complete list into memory, then binary search, time is O(log(n)), space is O(n*m)
I'm not sure about the complexity on tri, please correct me if they are wrong. Also are there other good approaches?

It is O(m) time for the trie, and up to O(mlog(n)) for the binary search. The space is asymptotically O(nm) for any reasonable method, which you can probably reduce in some cases using compression. The trie structure is, in theory, somewhat better on memory, but in practice it has devils hiding in the implementation details: memory needed to store pointers and potentially bad cache access.
There are other options for implementing a set structure - hashset and treeset are easy choices in most languages. I'd go for the hash set as it is efficient and simple.

I think HashMap is perfectly fine for your case, since the time complexity for both put and get operations is O(1). It works perfectly fine even if you dont have a sorted list.!!!

Preprocessing is ok since I will be calling > this function many times over different
inputs.
As a food for thought, do you consider creating a set from the input data and then searching using particular hash? It will take more time process for the first time to build a set but if number of inputs is limited and you may return to them then set might be good idea with O(1) for "contains" operation for a good hash function.

I'd recommend a hashmap. You can find an extension to C++ for this in both VC and GCC.

Use a bloom filter. It is space efficient even for very large data and it is a fast rejection technique.

Related

Confusion about Hash Map vs Trie time complexity

Let's say we're comparing the time complexity of search function in hashmap vs trie.
On a lot of resources I can find, the time complexities are described as
Hashmap get: O(1)
vs
Trie search: O(k) where k is the length of chars in the string you want to search.
However, I find this a bit confusing. To me, this looks like the sample size "n" is defined differently in the two scenarios.
If we define n as the number of characters, and thus are interested in what's the complexity of this algorithm as the number of characters grow to infinity, wouldn't hashmap get also have a time complexity of O(k) due to its hash function?
On the other hand, if we define n as the number of words in the data structure, wouldn't the time complexity of Trie search also be O(1) since the search of the word doesn't depend on the number of words already stored in the Trie?
In the end, if we're doing an apple to apple comparison of time complexity, it looks to me like the time complexity of Hashmap get and Trie search would be the same.
What am I missing here?
Yes, you are absolutely correct.
What you are missing is that statements about an algorithm's complexity can be based on whatever input terms you like. Outside of school, such statements are made to communicate, and you can make them to communicate whatever you want.
It's important to make sure that you are understood, though, so if there is a chance for confusion about how the n in O(n) is measured, or any assumed constraints on the input (like bounded string size), then you should just specify that explicitly.

Perfect List Structure?

Is it theoretically possible to have a data-structure that has
O(1) access, insertion, deletion times
and dynamic length?
I'm guessing one hasn't yet been invented or we would entirely forego the use of arrays and linked lists (seperately) and instead opt to use one of these.
Is there a proof this cannot happen, and therefore some relationship between access-time, insertion-time and deletion-time (like conservation of energy) that suggests if one of the times becomes constant the other has to be linear or something along that.
No such data structure exists on current architectures.
Informal reasoning:
To get better than O(n) time for insertion/deletion, you need a tree data structure of some sort
To get O(1) random access, you can't afford to traverse a tree
The best you can do is get O(log n) for all these operations. That's a fairly good compromise, and there are plenty of data structures that achieve this (e.g. a Skip List).
You can also get "close to O(1)" by using trees with high branching factors. For example, Clojure's persistent data structure use 32-way trees, which gives you O(log32 n) operations. For practical purposes, that's fairly close to O(1) (i.e. for realistic sizes of n that you are likely to encounter in real-world collections)
If you are willing to settle for amortized constant time, it is called a hash table.
The closest such datastructure is a B+-tree, which can easily answer questions like "what is the kth item", but performs the requisite operations in O(log(n)) time. Notably iteration (and access of close elements), especially with a cursor implementation, can be very close to array speeds.
Throw in an extra factor, C, as our "block size" (which should be a multiple of a cache line), and we can get something like insertion time ~ log_C(n) + log_2(C) + C. For C = 256 and 32-bit integers, log_C(n) = 3 implies our structure is 64GB. Beyond this point you're probably looking for a hybrid datastructure and are more worried about network cache effects than local ones.
Let's enumerate your requirements instead of mentioning a single possible data structure first.
Basically, you want constant operation time for...
Access
If you know exactly where the entity that you're looking for is, this is easily accomplished. A hashed value or an indexed location is something that can be used to uniquely identify entities, and provide constant access time. The chief drawback with this approach is that you will not be able to have truly identical entities placed into the same data structure.
Insertion
If you can insert at the very end of a list without having to traverse it, then you can accomplish constant access time. The chief drawback with this approach is that you have to have a reference pointing to the end of your list at all times, which must be modified at update time (which, in theory, should be a constant time operation as well). If you decide to hash every value for fast access later, then there's a cost for both calculating the hash and adding it to some backing structure for quick indexing.
Deletion Time
The main principle here is that there can't be too many moving parts; I'm deleting from a fixed, well-defined location. Something like a Stack, Queue, or Deque can provide that for the most part, in that they're deleting only one element, either in LIFO or FIFO order. The chief drawback with this approach is that you can't scan the collection to find any elements in it, since that would take O(n) time. If you were going about the route of using a hash, you could probably do it in O(1) time at the cost of some multiple of O(n) storage space (for the hashes).
Dynamic Length
If you're chaining references, then that shouldn't be such a big deal; LinkedList already has an internal Node class. The chief drawback to this approach is that your memory is not infinite. If you were going the approach of hashes, then the more stuff you have to hash, the higher of a probability of a collision (which does take you out of the O(1) time, and put you more into an amortized O(1) time).
By this, there's really no single, perfect data structure that gives you absolutely constant runtime performance with dynamic length. I'm also unsure of any value that would be provided by writing a proof for such a thing, since the general use of data structures is to make use of its positives and live with its negatives (in the case of hashed collections: love the access time, no duplicates is an ouchie).
Although, if you were willing to live with some amortized performance, a set is likely your best option.

What is the performance (Big-O) for removeAll() of a treeset?

I'm taking a Java data structures course atm. One of my assignment asks me to choose a data structure of my choice and write a spell checker program. I am in the process of checking the performance of the different data structures.
I went to the api for treeset and this is what it says...
"This implementation provides guaranteed log(n) time cost for the basic operations (add, remove and contains)."
would that include removeAll()?
how else would I be able to figure this out
thank you in advance
It would not include removeAll(), but I have to disagree with polkageist's answer. It is possible that removeAll() could be executed in constant time depending on the implementation, although it seems most likely that the execution would happen in linear time.
I think that NlogN would be if it was implemented in pretty much the worst way. If you are removing each element, there is no need to search for elements. Any element that you have needs to be removed, so there's no need to search.
Nope. For an argument collection of size k, the worst-case upper bound of removeAll() is, of course, O(k*log n) - because each of the elements contained in the argument collection have to be removed from the tree set (this requires at least searching for them), each of this searches yielding a cost of log n.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Is a data structure implementation with O(1) search possible without using arrays?

I am currently taking a university course in data structures, and this topic has been bothering me for a while now (this is not a homework assignment, just a purely theoretical question).
Let's assume you want to implement a dictionary. The dictionary should, of course, have a search function, accepting a key and returning a value.
Right now, I can only imagine 2 very general methods of implementing such a thing:
Using some kind of search tree, which would (always?) give an O(log n) worst case running time for finding the value by the key, or,
Hashing the key, which essentially returns a natural number which corresponds to an index in an array of values, giving an O(1) worst case running time.
Is O(1) worst case running time possible for a search function, without the use of arrays?
Is random access available only through the use of arrays?
Is it possible through the use of a pointer-based data structure (such as linked lists, search trees, etc.)?
Is it possible when making some specific assumptions, for example, the keys being in some order?
In other words, can you think of an implementation (if one is possible) for the search function and the dictionary that will receive any key in the dictionary and return its value in O(1) time, without using arrays for random access?
Here's another answer I made on that general subject.
Essentially, algorithms reach their results by processing a certain number of bits of information. The length of time they take depends on how quickly they can do that.
A decision point having only 2 branches cannot process more than 1 bit of information. However, a decision point having n branches can process up to log(n) bits (base 2).
The only mechanism I'm aware of, in computers, that can process more than 1 bit of information, in a single operation, is indexing, whether it is indexing an array or executing a jump table (which is indexing an array).
It is not the use of an array that makes the lookup O(1), it's the fact that the lookup time is not dependent upon the size of the data storage. Hence any method that accesses data directly, without a search proportional in some way to the data sotrage size, would be O(1).
you could have a hash implemented with a trie tree. The complexity is O(max(length(string))), if you have strings of limited size, then you could say it runs in O(1), it doesn't depend on the number of strings you have in the structure. http://en.wikipedia.org/wiki/Trie

Resources