Efficient data structure for storing nongrammatical strings - data-structures

I need a data structure in which I can store object of variable size and later modify its bytes or remove it, but not change its size (that would be done only be removing it and reinserting with new size). Objects do not need random access, only sequential. I need its memory-efficiency to approach 1 as the total memory allocated approaches infinity (assuming all pointers magically require a constant space and we won't be questioning that).
I do know for tries and that it's popular for storing strings, but after all my needs trie is just not what I am looking for. After all, my strings will not have "common" morphemes, they will be technical and pseudo-random. I am not storing words.
Another option I came to is to have a magic constant M and then M vectors where k-th vector stores chunks of k bytes and one pointer which points to another chunk (a previous block in the following context). Additionally, the elements in M-th chunk will have 2 pointers: one for the previous and one for the next chunk. Then I would split my string (that I am about to insert) into chunks of M bytes each and then store them in M-th vector as a linked list. The last chunk with possibly less than M bytes I would store in the appropriate other vector. When removing a string, I would remove all its chunks from vectors and then reallocate lingering chunks and reconnect them so that new vectors constitute from consecutive chunks, i.e. don't have holes.
This idea satisfies my needs except its converging efficiency. Additionally, there comes the cost of M separate vectors which can't be ignored in computers.
Is there any other already existing idea which explains how to build this structure?

Related

Sorting a small array into a large sorted array

What is the best algorithm for merging a large sorted array with a small unsorted array?
I'll give examples of what I mean from my particular use case, but don't feel bound by them: I'm mostly trying to give a feel for the problem.
8 MB sorted array with 92 kB unsorted array (in-cache sort)
2.5 GB sorted array with 3.9 MB unsorted array (in-memory sort)
34 GB sorted array with 21 MB unsorted array (out-of-memory sort)
You can implement a chunk-based algorithm to solve this problem efficiently (whatever the input size of the arrays as long as one is much smaller than the other).
First of all, you need to sort the small array (possibly using a radix sort or a bitonic sort if you do not need a custom comparator).
Then the idea is to cut the big array in chunks fully fitting in the CPU cache (eg. 256 KiB).
For each chunk, find the index of the last item in the small array <= to the last item of the chunk using a binary search.
This is relatively fast because the small array likely fit in the cache and the same items of the binary search are fetched between consecutive chunks if the array is big.
This index enable you to know how many items need to be merged with the chunks before being written.
For each value to be merged in the chunk, find the index of the value using a binary search in the chunk.
This is fast because the chunk fit in the cache.
Once you know the index of the values to be inserted in the chunk, you can efficiently move the item by block in each chunk (possibly in-place from the end to the beginning).
This implementation is much faster than the traditional merge algorithm since the number of comparison needed is much smaller thanks to the binary search and small number of items to be inserted by chunk.
For relatively big input, you can use a parallel implementation. The idea is to work on a group of multiple chunks at the same time (ie. super-chunks).
Super-chunks are much bigger than classical ones (eg. >=2 MiB).
Each thread work on a super-chunk at a time. A binary search is performed on the small array to know how many values are inserted in each super-chunk.
This number is shared between threads so that each threads know where it can safely write the output independently of other thread (one could use a parallel-scan algorithm to do that on massively parallel architecture). Each super-chunk is then split in classical chunks and the previous algorithm is used to solve the problem in each thread independently.
This method should be more efficient even in sequential when the small input arrays do not fit in the cache since the number of binary search operations in the whole small array will be significantly reduced.
The (amortized) time complexity of the algorithm is O(n (1 + log(m) / c) + m (1 + log(c))) with m the length of the big array, n the length of the small array and c the chunk size (super-chunks are ignored here for sake of clarity, but they only change the complexity by a constant factor like the constant c does).
Alternative method / Optimization: If your comparison operator is cheap and can be vectorized using SIMD instructions, then you can optimize the traditional merge algorithm. The traditional method is quite slow because of branches (that can hardly be predicted in the general case) and also because it cannot be easily/efficiently vectorized. However, because the big array is much bigger than the small array, the traditional algorithm will pick a lot of consecutive value from the big array in between the ones of the small array. This means that you can pick SIMD chunks of the big array and compare the values with one of the small array. If all SIMD items are smaller than the one picked from the small array, then you can write the whole SIMD chunk at once very efficiently. Otherwise, you need to write a part of the SIMD chunk, then write the item of the small array and switch to the next one. This last operation is clearly less efficient but it should happen rarely since the small array is much smaller than the big one. Note that the small array still needs to be sorted first.

algorithm to find duplicated byte sequences

Hello everyone who read the post. Help me please to resolve a task:
On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?
Usually this type of problems come with CPU/memory tradeoffs.
The extreme solutions would be (1) and (2) below, you can improve from there:
High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.
Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

External merge sort algorithm

I am having certain trouble understanding the merge step in external sort algorithm.I saw this example in Wikipedia but could not understand it.
One example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
1) Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
2) Write the sorted data to disk.
3) Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4) Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5) Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file, and empty it. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.
I am not able to understand the 4th step here.Why are reading first 10MB of memory when we have 100 MB of available memory.How to we decide number of passes in external merge?Will we sort each chunk and store them in 9 files?
Suppose that you've broken apart the range to be sorted into k sorted blocks of elements. If you can perform a k-way merge of these sorted blocks and write the result back to disk, then you'll have sorted the input.
To do a k-way merge, you store k read pointers, one per file, and repeatedly look at all k elements, take the smallest, then write that element to the output stream and advance the corresponding read pointer.
Now, since you have all the data stored in files on disk, you can't actually store pointers to the elements that you haven't yet read because you can't fit everything into main memory.
So let's start with a simple way to simulate what the normal merge algorithm would do. Suppose that you store an array of k elements in memory. You read one element from each file into each array slot. Then, you repeat the following:
Scan across the array slots and take the smallest.
Write that element to the output stream.
Replace that array element by reading the next value from the corresponding file.
This approach will work correctly, but it's going to be painfully slow. Remember that disk I/O operations take much, much longer than the corresponding operations in main memory. This merge algorithm ends up doing Θ(n) disk reads (I assume k is much less than n), since every time the next element is chosen, we need to do another read. This is going to be prohibitively expensive, so we need a better approach.
Let's consider a modification. Now, instead of storing an array of k elements, one per file, we store an array of k slots, each of which holds the first R elements from the corresponding file. To find the next element to output, we scan across the array and, for each array, look at the first element we haven't yet considered. We take that minimum value, write it to the output, then remove that element from the array. If this empties out one of the slots in the array, we replenish it by reading R more elements from the file.
This is more complicated, but it significantly cuts down on how many disk reads we need to do. Specifically, since the elements are read in blocks of size R, we only need to do Θ(n / R) disk reads.
We could take a similar approach for minimizing writes. Instead of writing every element to disk one at a time (requiring Θ(n) writes), we store a buffer of size W, accumulating elements into it as we go and only writing the buffer once it fills up. This requires Θ(n / W) disk writes.
Clearly, making R and W bigger will make this approach go a lot faster, but at the cost of more memory. Specifically, we need space for kR items to store k copies of the read buffers of size R, and we need space for W items to store the write buffer of size W. Therefore, we need to pick R and W so that kR + W items fit into main memory.
In the example given above, you have 100MB of main memory and 900MB to sort. If you split the array into 9 pieces, then you need to pick R and W so that (kR + W) · sizeof(record) ≤ 100MB. If every item is one byte, then picking R = 10MB and W = 10MB ensures that everything fits. This is also probably a pretty good distribution, since it keeps the number of reads and writes low.

Sorting with limited memory and read-only disk

Imagine the following scenario: I have a 10 Mb array of integers stored on a read-only storage medium. I wish to print out the numbers in ascending order. However, I only have 2 Mb of main memory (and no hard disk).
A very simple O(n2) solution (which doesn't make use of the available main memory) would be to repeatedly scan the entire input array and incrementally output the next smallest integer. I've tried googling for better sorting algorithms, but the answers keep leading me to in-place or external sorting algorithms, which would not work because of the read-only storage constraint. Is there a better solution?
You can use the main memory to reduce the number of scans, with the relation o sizes you gave, quite dramatically.
First scan: Keep an in-memory store of nearly the main memory size with the smallest numbers found so far. While the store is not yet full, add the next number read from the array. When the store is full, compare to the largest number in the store, if the new one is smaller, remove the largest number and add the new one. When the complete array has been scanned, output the found numbers in order, remember the largest number stored and how often that occurred in this chunk.
Subsequent scans: If the number scanned equals the largest number from the previous chunk and its occurrence count is smaller than its count from the previous scan, increment its occurrence count, but don't add it to the store, if its occurrence count is larger than or equal to the remembered count add the number to the store (removing the largest number from the store if necessary). If the scanned number is larger than the largest number of the previous scan, but smaller than the largest number in the store (or the store is not yet full), add it to the store (remove largest number if necessary). When the scan is complete, output the stored numbers in order, and remember the largest number output so far, and the number it has been output in total (the largest number might be the same as the one from the previous scan, so you need to know how often it was output in all chunks treated so far).
I'm not sure what the best data structure for the store would be, but I think a heap would be a good choice (comparison with largest: O(1), replacing: O(log size), final sorting for output: O(size*log size), practically no memory overhead as you would have with a binary search tree).

Find duplicate strings in a large file

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates
erickson's answer is probably the one expected by whoever set this question.
You could use each of the N machines as a bucket in a hashtable:
for each string, (say string number i in sequence) compute a hash function on it, h.
send the the values of i and h to machine number n for storage, where n = h % N.
from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
check the sets of strings with equal hash values, to see whether they're actually equal.
To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.
Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

Resources