"In-place" iterative LSD n-radix sort - algorithm

Is it possible to implement a "in-place" iterative LSD n-radix sort? To clarify: I've read the wikipedia atricle on in-place MSD radix sort. There it says that:
Counting sort is used to determine the size of each bin and their starting index.
and therefore an auxiliary array is needed to store the indexes but if that is all that it needs I still consider it to be a in-place algorithm (since this array would only be of size n for a n-radix sort). I have also read this answer where once again a recursive MSD radix sort is implemented. That one also has no general implementation for a n-radix sort.

EDIT: I've finally described my algorithm in the answer here: https://cs.stackexchange.com/questions/93563/fast-stable-almost-in-place-radix-and-merge-sorts
Yes. Virtually split input data into pages of some size (4 KB will serve well). Then reuse these pages as you consumed their data. You will need some extra memory, though - up to n pages for initial buckets, next-page pointers (one pointer per page), 3*n pointers for head_page/current_page/current_write_ptr per bucket
The algo:
Alloc n pages and put them into list of free pages
Process input data, moving entries into their buckets. Once entire page of input data is processed, add this page to the list of free pages. Once entire bucket page is filled, add it to the list of this bucket pages and alloc new one from free list. You will end up with list of pages per each bicket, last page in this list is partially filled
Move data around to fit it back to the original array in the order of buckets
Of course, if you need multiple LSD passes, you can skip step 3 except for last pass and start each sorting pass directly from list built at step 2. It will require n extra pages, though.

Related

Variable length free list

The concept of free list is commonly used for re-using space so if I have a file full of fixed-length values and I delete one, I put it on a free list. Then when I need to insert new value I take one from the list and put it on that spot.
However I am a bit confused as to how to implement a free list for values that are variable length. If I delete the value and I put the position and its length on the free list, how do I retrieve the "best" candidate for a new value?
Using plain list will be O(n) time complexity. Using a tree (with length as key) would make that log(n). Is there anything better that would give O(1)?
Yes, a hash table! So you have a big hashtable containing the sizes of the free blocks as keys and the values are arrays holding pointers to blocks of the corresponding sizes. So each time you free a block:
hash[block.size()].append(block.address())
And each time you allocate a free block:
block = hash[requested_size].pop()
The problem with this method is there are too many possible block sizes. Therefore the hash will fill up with millions of keys, wasting enormous amounts of memory.
So instead you can have a list and iterate it to find a suitable block:
for block in blocks:
if block.size() >= requested_size:
return blocks.remove(block)
Memory efficient but slow because you might have to scan through millions of blocks.
So what you do is you combine these two methods. If you set your allocation quanta to 64, then a hash containing 256 size classes can be used for all allocations up to 64 * 256 = 16 kb. Blocks larger than that you store in a tree which gives you O(log N) insertion and removal.

Assembly sorting

I am working on a Task Schedule Simulator which needs to be programmed in Assembly language.
I've been struggling about Task sorting:
I am allocating new memory for each Task (user can insert the task and by using the sbrk instruction i allocate 20 byte that contain a word for Task's numeric ID, another word for it's priority expressed as an int, another word for the number of cycles to finish the task) and I'm storing the address of each new Task in the stack.
My problem is: i need to sort this tasks and the sorting can either be based on priority or number of cycles. When I pop these Tasks i can easily access the right field (since the structure is very rigid, i just need to type the right offset in the lw instruction and voilat), but then comparing and sorting gets complicated.
I am working on the pseudocode for this part of the program and can't find any way to untie the knot.
Let me first try and paraphrase what you have indicate as the problem.
You have a stack, that has "records" of the structure
{ word : id, word : priority, word : cycle_count, dword : address}
Since the end objective is to "pop" these in desired order, we have to execute an in-place sort. There are many choices of algorithms, but to keep matters simple (also taking a cue from an underlying assumption that the count of tasks is not that many), I am explaining using bubble-sort. There exist a vast cornucopia of literature comparing each probable sort algorithm to their finest details, and if relevant, you may consider wikipedia as the perfect starting point.
Step 1 : Make the data pointer = stack pointer+count_of_records*20 - effectively, for next few steps, the data pointer points to the top of the "table of records" which happens to be located at the stack. An advanced consideration but not required in MIPS, is to assert DS=SS.
Step 2 : Next, identify which record pair needs to be swapped, and use the appropriate index within a record to identify the field that defines the swapping order.
Step 3: Allocate a 20 byte space as temporary, and use that space to temporarily hold the swapped record. An advanced consideration here is whether the environment can do an interrupt while the swap is going on. MIPS does not seem to have an atomic lock so this memory move needs to be carefully done.
Once the requisite number of passes are completed, the table will appear sorted, and will remain in place. The temporary buffer to store a record may be released.
The vital statistics for Bubble-Sort is that its O(n^2), responds well to almost sorted situations (not very likely in your example), and will handle the fact well that in midst of sorting, some records may find the processor free to start running, and therefore, will have to be removed from the queue by a POP and the sort needs to be restarted. This restart, however, will find the table almost sorted, and therefore, on a continuous basis, the table will display fairly strong pre-sorted behavior. Most importantly, has the perhaps most efficient code footprint among all in-situ algorithms.
Trust this helps
You might want to introduce a level of indirection, and sort pointers to your structs based on comparing the pointed-to data.
If your sort keys are all integers of the same size at different offsets within the structs, your sort function could take an offset as a parameter. e.g. lw from base + off to get the integer that you're going to compare.
Insertion sort is probably easiest to code, and much better that BubbleSort in the not-almost-sorted case. If you care about having a result ready to pop ASAP, before the whole array is sorted, then use Selection Sort.
It wasn't clear if your code is itself going to be multi-threaded, or if you can just write a normal sort function. #qasar66's answer seems to be suggesting a BubbleSort with atomic swaps, so other threads can safely look at the partially-sorted array while its being sorted.
If you only ever need to pop the min element, one of the best data structures is a Heap. It takes more code to implement, so if ease of implementation is your top goal, use a simple sort. Heapifying an un-sorted array is cheaper than doing a full sort: the full O(n log n) cost of extracting all elements in order is amortized over the extracts. So it's great if you want to be able to change the sort key, since you don't have to do all the work of fully sorting.

Finding k most common words in a file - memory usage

Let's say you are given a huge file, say 1GB. The file contains a word on each line (total n words), and you want to find the k most frequent terms in the file.
Now, assuming you have enough memory to store these words, what is the better way to approach the question in terms of reducing the memory usage and the constant overhead in the Big-O complexity? I believe there are two basic algorithms one can use:
Use a hash table and a min-heap to store the occurrences and the top K words seen. This is O(n + nlogk) ~ O(N)
Use a trie to store the words and occurrences and then traverse the trie to count the most frequent words. This is O(n*p) ~ O(N) where p is the length of the longest word.
Which is a better approach?
Also: if you didn't have enough memory for a hash table/trie (i.e. limited memory of 10MB or so), then what is the best approach?
Which is more efficient regarding the constant is very dependent. On one hand, trie offers strict O(N) time complexity for inserting all elements, while the hash table might decay to quadric time on worst case.
On the other hand, tries are not very efficient when it comes to cache - each seek requires O(|S|) random access memory requests, which might cause the performance to decay significantly.
Both approaches are valid, and I think there are multiple considerations that should be taken when choosing one over the other like maximum latency (if it is a real time system), throughput, and time to develop.
If average case performance is all that matters, I'd suggest to generate a bunch of files and run statistical analysis which approach is better. Wilcoxon signed test is the de-facto state of the art hypothesis test in use.
Regarding embedded systems: both approaches are still valid, but in here:
Each "Node" (or bunch of nodes) in the trie will be on disk rather then on RAM. Note that it means for the trie O(|S|) random access disk seeks per entry, which might be slowish.
For hashing solutions, you have 10MB, let's say they can use 5MB out of these for hash table of pointers to disk. Let's also assume you can store 500 different disk addresses on these 5MB (pessimistic analysis here), that means that you have 5MB left to load a bucket after each hash seek, and if you have 500 buckets, with load factor of 0.5, it means you can store 500 * 5MB * 0.5 ~= 1.25GB > 1GB of you data, thus using the hash table solution, so using hashing - each seek will need only O(1) random disk seeks in order to find the bucket containing the relevant string.
Note that if it still is not enough, we can rehash the pointers tables, very similar to what is being done in the paging table in the virtual memory mechanism.
From this we can conclude, for embedded systems, the hash solution is better for most cases (note it might still suffer from high latency on worst cases, no silver bullet here).
PS, radix tree is usually faster and more compact then trie, but suffers from the same side effects of trie comparing to hash tables (though less significant, of course).
For the limited memory option you could quick sort the list first, then simply populate a hash table with k items in it. You would then need one more counter to know how many items where in the current word you were checking - if its higher then you replace the lowest item in the hash table with your current item.
This would probably work ok for the initial list, but would be slower than just scanning the full list, and populating a hash table with the count.
Do you drive to store intermediate results ? if true:
you may have some meta structure.
and a set of hashetable.
You read a part of data (while size of you hash < 3 mb) and fill hashtable.
while size > 3mb you save on disk.
if you limit is 10 mb size of hashtable is 3 mb(for example).
meta discribe your hashtables.
in meta you may store number of unique word and count of all word in this hash and max count of one world!!! i
after this.
you may load hashtables from disk and merged.
for example you may load hashtable in ascending order of unique words or max count of one world in hash.
in this step you may use some heuristic.

Most frequent words in a terabyte of data

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.
One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem.
I thought of another solution using Map-Reduce by splitting the chunks on different nodes.
Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.
Which one of the above would be a better solution? I think the first solution is pretty naive.
Split your available memory into two halves. Use one as a 4-bit counting Bloom filter and the other half as a fixed size hash table with counts. The role of the counting Bloom filter is to filter out rarely occuring words with high memory efficiency.
Check your 1 TB of words against the initially empty Bloom filter; if a word is already in and all buckets are set to the maximum value of 15 (this may be partly or wholly a false positive), pass it through. If it is not, add it.
Words that passed through get counted; for a majority of words, this is every time but the first 15 times you see them. A small percentage will start to get counted even sooner, bringing a potential inaccuracy of up to 15 occurrences per word into your results. That's a limitation of Bloom filters.
When the first pass is over, you can correct the inaccuracy with a second pass if desired. Deallocate the Bloom filter, deallocate also all counts that are not within 15 occurrences behind the tenth most frequent word. Go through the input again, this time accurately counting words (using a separate hash table), but ignoring words that have not been retained as approximate winners from the first pass.
Notes
The hash table used in the first pass may theoretically overflow with certain statistical distributions of the input (e.g., each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try out whether this can realistically happen to you or not.
Note also that the bucket width (4 bits in the above description) is just a parameter of the construction. A non-counting Bloom filter (bucket width of 1) would filter out most unique words nicely, but do nothing to filter out other very rarely occuring words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and it will also reduce guaranteed accuracy level after the first pass (15 occurrences in the case of 4 bits). But these downsides will be quantitatively insignificant until some point, while I'm imagining the more aggresive filtering effect as completely crucial for keeping the hash table in sub-gigabyte sizes with non-repetitive natural language data.
As for the order of magnitude memory needs of the Bloom filter itself; these people are working way below 100 MB, and with a much more challenging application ("full" n-gram statistics, rather than threshold 1-gram statistics).
Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.
When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)
Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.
This is slower but simpler to implement than my other answer.
The best I could think of:
Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.
It won't always give you correct answer, but it will work in fixed memory and linear time.
And why do you think a building of the Trie structure is not the best decision? Mark all of child nodes by a counter and that's it! Maximum memory complexity will be O(26 * longest_word_length), and time complexity should be O(n), that's not bad, is it?

Find common words from two files

Given two files containing list of words(around million), We need to find out the words that are in common.
Use Some efficient algorithm, also not enough memory availble(1 million, certainly not).. Some basic C Programming code, if possible, would help.
The files are not sorted.. We can use some sort of algorithm... Please support it with basic code...
Sorting the external file...... with minimum memory available,, how can it be implement with C programming.
Anybody game for external sorting of a file... Please share some code for this.
Yet another approach.
General. first, notice that doing this sequentially takes O(N^2). With N=1,000,000, this is a LOT. Sorting each list would take O(N*log(N)); then you can find the intersection in one pass by merging the files (see below). So the total is O(2N*log(N) + 2N) = O(N*log(N)).
Sorting a file. Now let's address the fact that working with files is much slower than with memory, especially when sorting where you need to move things around. One way to solve this is - decide the size of the chunk that can be loaded into memory. Load the file one chunk at a time, sort it efficiently and save into a separate temporary file. The sorted chunks can be merged (again, see below) into one sorted file in one pass.
Merging. When you have 2 sorted lists (files or not), you can merge them into one sorted list easily in one pass: have 2 "pointers", initially pointing to the first entry in each list. In each step, compare the values the pointers point to. Move the smaller value to the merged list (the one you are constructing) and advance its pointer.
You can modify the merge algorithm easily to make it find the intersection - if pointed values are equal move it to the results (consider how do you want to deal with duplicates).
For merging more than 2 lists (as in sorting the file above) you can generalize the algorithm for using k pointers.
If you had enough memory to read the first file completely into RAM, I would suggest reading it into a dictionary (word -> index of that word ), loop over the words of the second file and test if the word is contained in that dictionary. Memory for a million words is not much today.
If you have not enough memory, split the first file into chunks that fit into memory and do as I said above for each of that chunk. For example, fill the dictionary with the first 100.000 words, find every common word for that, then read the file a second time extracting word 100.001 up to 200.000, find the common words for that part, and so on.
And now the hard part: you need a dictionary structure, and you said "basic C". When you are willing to use "basic C++", there is the hash_map data structure provided as an extension to the standard library by common compiler vendors. In basic C, you should also try to use a ready-made library for that, read this SO post to find a link to a free library which seems to support that.
Your problem is: Given two sets of items, find the intersaction (items common to both), while staying within the constraints of inadequate RAM (less than the size of any set).
Since finding an intersaction requires comparing/searching each item in another set, you must have enough RAM to store at least one of the sets (the smaller one) to have an efficient algorithm.
Assume that you know for a fact that the intersaction is much smaller than both sets and fits completely inside available memory -- otherwise you'll have to do further work in flushing the results to disk.
If you are working under memory constraints, partition the larger set into parts that fit inside 1/3 of the available memory. Then partition the smaller set into parts the fit the second 1/3. The remaining 1/3 memory is used to store the results.
Optimize by finding the max and min of the partition for the larger set. This is the set that you are comparing from. Then when loading the corresponding partition of the smaller set, skip all items outside the min-max range.
First find the intersaction of both partitions through a double-loop, storing common items to the results set and removing them from the original sets to save on comparisons further down the loop.
Then replace the partition in the smaller set with the second partition (skipping items outside the min-max). Repeat. Notice that the partition in the larger set is reduced -- with common items already removed.
After running through the entire smaller set, repeat with the next partition of the larger set.
Now, if you do not need to preserve the two original sets (e.g. you can overwrite both files), then you can further optimize by removing common items from disk as well. This way, those items no longer need to be compared in further partitions. You then partition the sets by skipping over removed ones.
I would give prefix trees (aka tries) a shot.
My initial approach would be to determine a maximum depth for the trie that would fit nicely within my RAM limits. Pick an arbitrary depth (say 3, you can tweak it later) and construct a trie up to that depth, for the smaller file. Each leaf would be a list of "file pointers" to words that start with the prefix encoded by the path you followed to reach the leaf. These "file pointers" would keep an offset into the file and the word length.
Then process the second file by reading each word from it and trying to find it in the first file using the trie you constructed. It would allow you to fail faster on words that don't match. The deeper your trie, the faster you can fail, but the more memory you would consume.
Of course, like Stephen Chung said, you still need RAM to store enough information to describe at least one of the files, if you really need an efficient algorithm. If you don't have enough memory -- and you probably don't, because I estimate my approach would require approximately the same amount of memory you would need to load a file whose words were 14-22 characters long -- then you have to process even the first file by parts. In that case, I would actually recommend using the trie for the larger file, not the smaller. Just partition it in parts that are no bigger than the smaller file (or no bigger than your RAM constraints allow, really) and do the whole process I described for each part.
Despite the length, this is sort of off the top of my head. I might be horribly wrong in some details, but this is how I would initially approach the problem and then see where it would take me.
If you're looking for memory efficiency with this sort of thing you'll be hard pushed to get time efficiency. My example will be written in python, but should be relatively easy to implement in any language.
with open(file1) as file_1:
current_word_1 = read_to_delim(file_1, delim)
while current_word_1:
with open(file2) as file_2:
current_word_2 = read_to_delim(file_2, delim)
while current_word_2:
if current_word_2 == current_word_1:
print current_word_2
current_word_2 = read_to_delim(file_2, delim)
current_word_1 = read_to_delim(file_1, delim)
I leave read_to_delim to you, but this is the extreme case that is memory-optimal but time-least-optimal.
depending on your application of course you could load the two files in a database, perform a left outer join, and discard the rows for which one of the two columns is null

Resources