How to sort (million/billion/...) integers? - algorithm

Sometimes interviewers ask how to sort million/billion 32-bit integers (e.g. here and here). I guess they expect the candidates to compare O(NLog(N)) sort with radix sort. For million integers O(NLog(N)) sort is probably better but for billion they are probably the same. Does it make sense ?

If you get a question like this, they are not looking for the answer. What they are trying to do is see how you think through a problem. Do you jump right in, or do you ask questions about the project requirements?
One question you had better ask is, "How optimal of solution does the problem require?" Maybe a bubble sort of records stored in a file is good enough, but you have to ask. Ask questions about what if the input changes to 64 bit numbers, should the sort process be easily updated? Ask how long does the programmer have to develop the program.
Those types of questions show me that the candidate is wise enough to see there is more to the problem than just sorting numbers.

I expect they're looking for you to expand on the difference between internal sorting and external sorting. Apparently people don't read Knuth nowadays

As aaaa bbbb said, it depends on the situation. You would ask questions about the project requirements. For example, if they want to count the ages of the employees, you probably use the Counting sort, I can sort the data in the memory. But when the data are totally random, you probably use the external sorting. For example, you can divide the data of the source file into the different files, every file has a unique range(File1 is from 0-1m, File2 is from 1m+1 - 2m , ect ), then you sort every single file, and lastly merge them into a new file.

Use bit map. You need some 500 Mb to represent whole 32-bit integer range. For every integer in given array just set coresponding bit. Then simply scan your bit map from left to right and get your integer array sorted.

It depends on the data structure they're stored in. Radix sort beats N-log-N sort on fairly small problem sizes if the input is in a linked list, because it doesn't need to allocate any scratch memory, and if you can afford to allocate a scratch buffer the size of the input at the beginning of the sort, the same is true for arrays. It's really only the wrong choice (for integer keys) when you have very limited additional storage space and your input is in an array.
I would expect the crossover point to be well below a million regardless.

Related

Selection Sort vs. Insertion Sort

I'm just wondering, would the speed of selection sort change depending on the TYPE of data? So like would the speed change depending on if it were words or numbers? why or why not?. Also, would the speed of insertion sort change depending on the TYPE of data?
I personally don't think they do change the speed of the sorting but I'm not 100% sure.
A comparison-based sort algorithm needs to know, among other things:
how to tell if element a is "smaller" than element b or not
how to swap two elements in the array,
ie. moving one element to the position of another and vice-versa.
Other than this two things, no part of the algorithm depends on the actual type and/or content of the data elements. The only possible speed difference is in this parts.
Yes, comparing eg. strings is usually slower than comparing integer, and the longer the strings get, the more noticable the difference will be. The same is valid for swapping, altough there are some C++ things to remedy that (depending on the implementation of the data type, swapping could be done by copying everything (slow) or just by swapping eg. a pointer (faster))

Find common words from two files

Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources