algorithm to find duplicated byte sequences - algorithm

Hello everyone who read the post. Help me please to resolve a task:
On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?

Usually this type of problems come with CPU/memory tradeoffs.
The extreme solutions would be (1) and (2) below, you can improve from there:
High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.

Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

Related

Efficient data structure for storing nongrammatical strings

I need a data structure in which I can store object of variable size and later modify its bytes or remove it, but not change its size (that would be done only be removing it and reinserting with new size). Objects do not need random access, only sequential. I need its memory-efficiency to approach 1 as the total memory allocated approaches infinity (assuming all pointers magically require a constant space and we won't be questioning that).
I do know for tries and that it's popular for storing strings, but after all my needs trie is just not what I am looking for. After all, my strings will not have "common" morphemes, they will be technical and pseudo-random. I am not storing words.
Another option I came to is to have a magic constant M and then M vectors where k-th vector stores chunks of k bytes and one pointer which points to another chunk (a previous block in the following context). Additionally, the elements in M-th chunk will have 2 pointers: one for the previous and one for the next chunk. Then I would split my string (that I am about to insert) into chunks of M bytes each and then store them in M-th vector as a linked list. The last chunk with possibly less than M bytes I would store in the appropriate other vector. When removing a string, I would remove all its chunks from vectors and then reallocate lingering chunks and reconnect them so that new vectors constitute from consecutive chunks, i.e. don't have holes.
This idea satisfies my needs except its converging efficiency. Additionally, there comes the cost of M separate vectors which can't be ignored in computers.
Is there any other already existing idea which explains how to build this structure?

Merging two binary strings and separating them

I am preparing for a job interview and have an question.
I have 2 binary arrays of n and m size. I need to create an algorithm for merging them together and then separating them. The merged array is have to be also binary array. No information about the size of merged array, I assume it could be n+m.
If you know what is the maximum size of A and B, then you can code the sizes of A and B in binary, and create a new binary array by multiplexing
size of A
Acontent
size if B
Bcontent
Then demultiplexing (separating A and B) is easy.
It is similar to what is performed in telecommunications.
Edit: I mentioned that maximum size must be known. This is because for demultiplexing we need to know how much bits are used to encode the sizes. Then, the number of bits for this encoding must be fixed.

Given a continuous stream of words, remove the duplicates

I was asked this question recently.
Given a continuous stream of words, remove the duplicates while reading the input.
Example:
Input: This is next stream of question see it is a question
Output: This next stream of see it is a question
Starting from end, question as well as is already appeared once, so the second time it's ignored.
My solution:
Use hashing in this scenario for each word coming through stream.
If there is a collision then then ignore that word.
It's definitely not a good solution. I was asked to optimize it.
What is the best approach to solve this problem?
Hashing isn't a particularly bad solution.
It gives expected O(wordLength) lookup time, but O(wordLength * wordCount) in the worst case, and uses O(maxWordLength * wordCount) space.
Alternatives:
Trie
A trie is a tree data structure where each edge corresponds to a letter and the path from the root defines the value of the node.
This will give O(wordLength) lookup time and uses O(wordCount * maxWordLength) space, although the actual space usage may be lower as repeated prefixes (e.g. te in the below example) only use space once.
Binary search tree
A binary search tree is a tree data structure where each node in the subtree rooted at the left child is smaller than its parent, and similarly all nodes to the right are greater.
A self-balancing one gives O(wordLength * log wordCount) lookup time and uses O(wordCount * maxWordLength) space.
Bloom filter
A bloom filter is a data structure consisting of some number of bits and a few hash functions which maps a word to a bit, sets the output of each hash function on add and checks if any are not set on query.
This uses less space than the above solutions, but at the cost of false positives - some words will be marked as duplicates that aren't.
Specifically, it uses 1.44 log2(1/e) bits per key, where e is the false positive rate, giving O(wordCount) space usage, but with an incredibly low constant factor.
This will give O(wordLength) lookup time.
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3.

Sorting with limited memory and read-only disk

Imagine the following scenario: I have a 10 Mb array of integers stored on a read-only storage medium. I wish to print out the numbers in ascending order. However, I only have 2 Mb of main memory (and no hard disk).
A very simple O(n2) solution (which doesn't make use of the available main memory) would be to repeatedly scan the entire input array and incrementally output the next smallest integer. I've tried googling for better sorting algorithms, but the answers keep leading me to in-place or external sorting algorithms, which would not work because of the read-only storage constraint. Is there a better solution?
You can use the main memory to reduce the number of scans, with the relation o sizes you gave, quite dramatically.
First scan: Keep an in-memory store of nearly the main memory size with the smallest numbers found so far. While the store is not yet full, add the next number read from the array. When the store is full, compare to the largest number in the store, if the new one is smaller, remove the largest number and add the new one. When the complete array has been scanned, output the found numbers in order, remember the largest number stored and how often that occurred in this chunk.
Subsequent scans: If the number scanned equals the largest number from the previous chunk and its occurrence count is smaller than its count from the previous scan, increment its occurrence count, but don't add it to the store, if its occurrence count is larger than or equal to the remembered count add the number to the store (removing the largest number from the store if necessary). If the scanned number is larger than the largest number of the previous scan, but smaller than the largest number in the store (or the store is not yet full), add it to the store (remove largest number if necessary). When the scan is complete, output the stored numbers in order, and remember the largest number output so far, and the number it has been output in total (the largest number might be the same as the one from the previous scan, so you need to know how often it was output in all chunks treated so far).
I'm not sure what the best data structure for the store would be, but I think a heap would be a good choice (comparison with largest: O(1), replacing: O(log size), final sorting for output: O(size*log size), practically no memory overhead as you would have with a binary search tree).

How to decode huffman tree?

is there better way than just go left or right based on the input digit 0 or 1?
There are some papers on efficient decoding algorithms for Huffman Trees. Personally, I only used one of them, for academic reasons, but it was a long time ago. The title of the paper was "A Memory Efficient and Fast Huffman Decoding Algorithm" by Hong-Chung Chen, Yue-Li Wang and Yu-Feng Lan.
The algorithm gives results in O(log n) time. In order to use this algorithm, you have to construct a table with all the symbols of your tree (leaves), and for each symbol you have to specify a weight:
w(i) = 2 ^ (h - l)
where h is the height of Huffman Tree and l is the level of the symbol, and a count:
count(i) = count(i-1) + w(i)
The count of root, count(0), is equal to its weight.
When you have all that, there are 3 simple steps in the algorithm, that are described in the paper.
I don't know if this is what you were looking for.
Yes, there is, and you can use lookup tables.
Note that you will use quite a bit of memory to store these tables, and you will either have to ship that table with the data (probably negating the effect of the compression altogether) or construct the table before decompressing (which would negate some, if not all, of the speedup you gain from it.)
Here's how the table would work.
Let's say, for a sample portion of compressed data that the bit-sequences for symbols are as follows:
1 -> A
01 -> B
00 -> C
What you do now is to produce a table, indexed by a byte (since you need to read 1 byte minimum to decode the first symbol), containing entries like this:
key symbol bit-shift
1xxxxxxx A 7
01xxxxxx B 6
00xxxxxx C 6
The x's means you need to store entries with all possible combinations for those x's, with that data. For the first row, this means you would construct a table where every byte-key with the high-bit set would map to A/7.
The table would have entries for all 256 key values, half of them mapped to A/7, and 25% to B/6 and C/6, according to the rules above.
Of course, if the longest bitsequence for a symbol is 9-16 bites, you need a table keyed by a 16-bit integer, and so on.
Now, when you decode, here's how you would do this:
read first and second byte as key, append them together
loop:
look up the first 8 bits of the current key in the table
emit symbol found in table
shift the key bitwise to the left the number of bits specified in the table
if less than 8 bits left in the key, get next byte and append to key
when at end, just pad with 0-bytes, and as with all Huffman decompression you need to know how many symbols to emit before you start.
Sure, instead of 2-trees you can use k-trees and get O(ln_k(n)) speedup. It's not much.
If the max key size is small (say < 8 bits) or you've got lots of memory, you can use a straight lookup table & get O(1).

Resources