Best data structure to store one million values?

Best data structure to store one million values? - algorithm

Please mention time complexity and best data structure to store these values, when values are:
Integers
Strings (dictionary like sorting)
I know Counting sort is preferred when integers are in a small range.
Thanks.
Edit:
Sorry, I asked a bit different question. Actual question is what would be the best data structure to store these values, if the integers are phone numbers (and strings are names) and then find the best sorting algorithm.

Have a look at:
Btrees and red-black trees.
You should be able to find open source implementations of each of these. (Note, I'm assuming that you want to maintain a sorted structure, rather than just sorting once and forgetting.)

Sorting algorithms wiki link: Sorting Algorithm Wiki
Merge sort and quick sort are pretty good, they are n log n in best cases.

How about a heap? Relatively easy to implement and pretty fast. For strings, you could use a Trie along with something like Burst sort which is supposedly the fastest string sorting algorithm in its class.

For most sorting algorithms there is an in-place version, so a simple array may be sufficient. For Strings you may consider a http://en.wikipedia.org/wiki/Trie, which could save space. The right sorting algorithm depends on a lot of factors, e.g. if the results may be already sorted or partially sorted. Of course if you have just a few different values, Countingsort, Bucketsort etc can be used.

On a 32-bit machine, a million integers can fit in an array of 4 million bytes. 4MB isn't all that much; it'll fit in this system's memory 500 times over (and it's not that beefy by modern standards). A million strings will be the same size, except for the storage space for those strings; for short strings it's still no problem, so slurp it all in. You can even have an array of pointers to structures holding an integer and a reference to a string; it will all fit just fine. It's only when you're dealing with much more data than that (e.g., a billion items) that you need to take special measures, data-structure-wise.
For sorting that many things, choose an algorithm that is O(nlogn) instead of one that is O(n2). The O(n) algorithms are only useful when you've got particularly compact key spaces, which is pretty rare in practice. Choosing which algorithm from the set that are in O(nlogn) is a matter of balancing speed and other good properties such as stability.
If you're doing this for real, use a database with appropriate indices instead of futzing around with doing this all by hand.

Related

Preferred Sorting For People Based On Their Age

Suppose we have 1 million entries of an object 'Person' with two fields 'Name', 'Age'. The problem was to sort the entries based on the 'Age' of the person.
I was asked this question in an interview. I answered that we could use an array to store the objects and use quick sort as that would save us from using additional space but interviewer told that memory was not a factor.
My question is what would be the factor that would decide which sort to use?
Also what would be the preferred way to store this?
In this scenario does any sorting algorithm have an advantage over another sorting algorithm and would result in a better complexity?

This Stackoverflow link may be useful to you.
The answers above are sufficient but i would like to add some more information from the link above.
I am copying some information from the answers in, the link above, over here.
We should note that even if the fields in the Object are very big (i.e. long names) you do not need to use a file system sort, you can use an in-memory sort, because
# elements * 8 ~= 762 MB (most modern systems have enough memory for that)
^
key(age) + pointer to struct requires 8 bytes in 32 bits system
It is important to minimize the disk accesses - because disks are not random access, and disk accesses are MUCH slower then RAM accesses.
Now, use a sort of your choice on that - and avoid using disk for the sorting process.
Some possibilities of sorts (on RAM) for this case are:
Standard quicksort or merge-sort (Which you had already thought of)
Bucket sort can also be applied here, since the rage is limited to [0,150] (Which others have specified here under the name Count Sort)
Radix sort (For the same reason, radix sort will need ceil(log_2(150)) ~= 8 iterations
I wanted to point out the memory aspect in case you may encounter the same question but may need to answer it taking the memory constraints into consideration. In fact your constraints are even less(10^6 compared to the 10^8 in the other question).
As for the matter of storing it -
The quickest way to sort it would be to allocate 151 linked lists/vector (let's call them buckets or whatever you may depending on the language you prefer) and put each person's data structure in the bucket according to his/her age(all people's ages are between 0 and 150):
bucket[person->age].add(person)
As others have pointed out Bucket Sort is going to be the better option for you.
In fact the beauty of bucket sort is that if you have to perform any operation on ranges of ages(like from 10-50 years of age) you can partition your bucket sizes according to your requirements(like have varied bucket range for each bucket).
I repeat again i have copied the information from the answers in the link given above, but i believe they might be useful to you.

If the array has n elements, then quicksort (or, actually, any comparison-based sort) is Ω(n log(n)).
Here, though, it looks like you have here an alternative to comparison-based sorting, since you need to sort only on age. Suppose there are m distinct ages. In this case, Counting Sort, will be Θ(m + n). For the specifics of your question, assuming that age is in years, m is much smaller than n, and you can do this in linear time.
The implementation is trivial. Simply create an array of, say, 200 entries (200 being an upper bound on the age). The array is of linked lists. Scan over the people, and place each person in the linked list in the appropriate entry. Now, just concatenate the lists according to the positions in the array.

Different sorting algorithms perform at different complexities, yes. Some use different amounts of space. And in practice, real performance with the same complexity varies too.
http://www.cprogramming.com/tutorial/computersciencetheory/sortcomp.html
There're different ways to set up a quicksort's partition method that could have an effect for ages. Shell sorts can have different gap settings that perform better for certain types of input. But maybe your interviewer was more interested in you thinking about 1 million people having a lot of duplicate ages; which might mean you want a 3-way quicksort, or as suggested in comments a counting sort.

This is an interview question, so I guess interviewee's answer is more important than correct sorting algorithm. Your problem is sorting array of Object with field age is integer. Age has some special properties:
integer: there are some sorting algorithms specially design for integer.
finite: you know maximum age of people, right? For example that will be 200.
I will list some sorting algorithm for this problem with advantages and disadvantages that suitable enough in one interview session:
Quick sort: complexity is O(NLogN) and can apply to any data set. Quicksort is the fastest sort that using compare operator between two elements. Biggest disadvantage of quicksort is quicksort isn't stable. That means two objects equal in age doesn't maintain order after sorting.
Merge sort: complexity is O(NLogN). Little bit slower than quicksort but this is a stable sort. Also this algorithm can apply to any data set.
radix sort: complexity is O(w*n), with n is size of your list and w is maximum length of number of digits in your dataset. For example: length of 12 is 3, length of 154 is 3. So if people's age maximum is 99, complexity should be O(2*n). This algorithm just can apply to integer or string.
Counting sort complexity is O(m+n). With n is size of your list and m is number of distinct ages. This algorithm just can apply to integer.
Because we are sorting milion of entries and all values are integer stand in range 0 .. 200 so ton of duplicate values. So counting sort is the best fit with complexity O(200 + N), with N ~= 1,000,000. 200 is not much.

If you assume that you have finite number of different values of age (usually people are not older then 100) then you could use
counting sort (https://en.wikipedia.org/wiki/Counting_sort). You would be able to sort in linear time.

what is the most memory efficient datastructure for maintaining a sordted <5k list of integers?

I have many lists of sorted integers, all of them less than 3600 items each. I'd like to keep them in memory as much as possible, so I'm looking for a space efficient datastructure.
Most common operations will be inserts, membership testing and range queries.
the integers will be mostly in the 1 through 10 billion-range, though in theory there could be some corner cases where the integers will be much lower.
I've been looking at skiplists, which are quite nice, but I feel there might be more efficient structures out there.

This really depends on the access pattern and the proportion of lookups with respect to modifications. When lookups are much more common than modifications (in your case, inserts apparently), which is quite common, you can actually get away with sorted arrays which will give you optimal memory efficiency.
If the inserts are actually more common, sorted arrays probably won't do, and you will have to resort to more complicated data structures. B-trees sound like a possible candidate given that they pack many nodes together, and thus do not suffer from the linkage overhead as much as AVLs, skip-lists or red-black trees.
I think it would be similarly interesting to investigate radix trees, especially if there happens to be a lot of successive integers in your lists, because such ranges would get "compressed" by the radix tree.
It is worth noting that a bloom filter can help further optimize your membership queries. They are, in a way, the most space-efficient data structures for membership queries, but being probabilistic, you can only use them in combination with some other deterministic data structure, unless of course you are allowed to return incorrect answers :-).

Criteria for selecting a sorting algorithm

I was curious to know how to select a sorting algorithm based on the input, so that I can get the best efficiency.
Should it be on the size of the input or how the input is arranged(Asc/Desc) or the data structure used etc ... ?

The importance of algorithms generally, and in sorting algorithms as well is as following:
(*) Correctness - This is the most important thing. It worth nothing if your algorithm is super fast and efficient, but is wrong. In sorting, even if you have 2 candidates that are sorting correctly, but you need a stable sort - you will chose the stable sort algorithm, even if it is less efficient - because it is correct for your purpose, and the other is not.
Next are basically trade offs between running time, needed space and implementation time (If you will need to implement something from scratch rather then use a library, for a minor performance enhancement - it probably doesn't worth it)
Some things to take into consideration when thinking about the trade off mentioned above:
Size of the input (for example: for small inputs, insertion sort is empirically faster then more advanced algorithms, though it takes O(n^2)).
Location of the input (sorting algorithms on disk are different from algorithms on RAM, because disk reads are much less efficient when not sequential. The algorithm which is usually used to sort on disk is a variation of merge-sort).
How is the data distributed? If the data is likely to be "almost sorted" - maybe a usually terrible bubble-sort can sort it in just 2-3 iterations and be super fast comparing to other algorithms.
What libraries do you have already implemented? How much work will it take to implement something new? Will it worth it?
Type (and range) of the input - for enumerable data (integers for example) - an integer designed algorithm (like radix sort) might be more efficient then a general case algorithm.
Latency requirement - if you are designing a missile head, and the result must return within specific amount of time, quicksort which might decay to quadric running time on worst case - might not be a good choice, and you might want to use a different algorithm which have a strict O(nlogn) worst case instead.
Your hardware - if for example you are using a huge cluster and a huge data - a distributed sorting algorithm will probably be better then trying to do all the work on one machine.

It should be based on all those things.
You need to take into account size of your data as Insertion sort can be faster than quicksort for small data sets, etc
you need to know the arrangement of your data due to differing worst/average/best case asymptotic runtimes for each of the algorithm (and some whose worst/avg cases are the same whereas the other may have significantly worse worst case vs avg)
and you obviously need to know the data structure used as there are some very specialized sorting algorithms if your data is already in a special format or even if you can put it into a new data structure efficiently that will automatically do your sorting for you (a la BST or heaps)

The 2 main things that determine your choice of a sorting algorithm are time complexity and space complexity. Depending on your scenario, and the resources (time and memory) available to you, you might need to choose between sorting algorithms, based on what each sorting algorithm has to offer.
The actual performance of a sorting algorithm depends on the input data too, and it helps if we know certain characteristics of the input data beforehand, like the size of input, how sorted the array already is.
For example,
If you know beforehand that the input data has only 1000 non-negative integers, you can very well use counting sort to sort such an array in linear time.
The choice of a sorting algorithm depends on the constraints of space and time, and also the size/characteristics of the input data.

At a very high level you need to consider the ratio of insertions vs compares with each algorithm.
For integers in a file, this isn't going to be hugely relevant but if say you're sorting files based on contents, you'll naturally want to do as few comparisons as possible.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?

This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!

The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.

With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.

O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Binary Trees vs. Linked Lists vs. Hash Tables

I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing and creating a symbol table.
I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. What are the advantages and or disadvantages of all of the above? (working in c++)

The standard trade offs between these data structures apply.
Binary Trees
medium complexity to implement (assuming you can't get them from a library)
inserts are O(logN)
lookups are O(logN)
Linked lists (unsorted)
low complexity to implement
inserts are O(1)
lookups are O(N)
Hash tables
high complexity to implement
inserts are O(1) on average
lookups are O(1) on average

Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".
Therefore you need to use an algorithm that is fast for looking up the information that you need.
I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).
Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.
Did you do a course on data structures and algorithms at university?

What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.
There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html
I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.

It sounds like the following may all be true:
Your keys are strings.
Inserts are done once.
Lookups are done frequently.
The number of key-value pairs is relatively small (say, fewer than a K or so).
If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.
Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.
As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:
A hash table can degenerate to O(N) in the worst case.
The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.
Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.
It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.

I like Bill's answer, but it doesn't really synthesize things.
From the three choices:
Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.
Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.
That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only lose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.

A couple of things to watch out for.
Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.
Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.
I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.

Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.
For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.
For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.
But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!

This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.

This question goes through the different containers in C#, but they are similar in any language you use.

Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.
A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.
A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio