Merging two binary strings and separating them - algorithm

I am preparing for a job interview and have an question.
I have 2 binary arrays of n and m size. I need to create an algorithm for merging them together and then separating them. The merged array is have to be also binary array. No information about the size of merged array, I assume it could be n+m.

If you know what is the maximum size of A and B, then you can code the sizes of A and B in binary, and create a new binary array by multiplexing
size of A
Acontent
size if B
Bcontent
Then demultiplexing (separating A and B) is easy.
It is similar to what is performed in telecommunications.
Edit: I mentioned that maximum size must be known. This is because for demultiplexing we need to know how much bits are used to encode the sizes. Then, the number of bits for this encoding must be fixed.

Related

Efficient data structure for storing nongrammatical strings

I need a data structure in which I can store object of variable size and later modify its bytes or remove it, but not change its size (that would be done only be removing it and reinserting with new size). Objects do not need random access, only sequential. I need its memory-efficiency to approach 1 as the total memory allocated approaches infinity (assuming all pointers magically require a constant space and we won't be questioning that).
I do know for tries and that it's popular for storing strings, but after all my needs trie is just not what I am looking for. After all, my strings will not have "common" morphemes, they will be technical and pseudo-random. I am not storing words.
Another option I came to is to have a magic constant M and then M vectors where k-th vector stores chunks of k bytes and one pointer which points to another chunk (a previous block in the following context). Additionally, the elements in M-th chunk will have 2 pointers: one for the previous and one for the next chunk. Then I would split my string (that I am about to insert) into chunks of M bytes each and then store them in M-th vector as a linked list. The last chunk with possibly less than M bytes I would store in the appropriate other vector. When removing a string, I would remove all its chunks from vectors and then reallocate lingering chunks and reconnect them so that new vectors constitute from consecutive chunks, i.e. don't have holes.
This idea satisfies my needs except its converging efficiency. Additionally, there comes the cost of M separate vectors which can't be ignored in computers.
Is there any other already existing idea which explains how to build this structure?

algorithm to find duplicated byte sequences

Hello everyone who read the post. Help me please to resolve a task:
On input I have an array of bytes. I need to detect duplicated sequences of bytes for the compressing duplicates. Can anyone help me to find acceptable algorithm?
Usually this type of problems come with CPU/memory tradeoffs.
The extreme solutions would be (1) and (2) below, you can improve from there:
High CPU / low memory - iterate over all possible combination sizes and letters and find duplicates (2 for statements)
Low CPU / high memory - create lookup table (hash map) of all required combinations and lengths, traverse the array and add to the table, later traverse the table and find your candidates
Improve from here - want ideas from (2) with lower memory, decrease lookup table size to make more hits and handle the lower problem later. want faster lookup for sequence length, create separate lookup table per length.
Build a tree with branches are all type of bytes (256 branches).
Then traverse the array, building new sub-branches. At each node, store a list of positions where this sequence is found.
For example: Let's say you are at node AC,40,2F. This sequence in the tree means: "Byte AC was found at position xx (one of its positions stored in that node). The next byte, 40, was at position yy=xx+1 (among others). The byte 2F was at position zz=yy+1
Now you want to "compress" only sequences of some size (e.g. 5). So traverse the tree an pay attention to depths 5 or more. In the 5th-deep subnode of a node you have already stored all positions where such sequence (or greater) is found in the array. Those positions are those you are interested to store in your compressed file.

Data structure for iterating over subsets of a collection

This is a general question about algorithims/data structures.
There is no specific programming language.
I'm working with arrays of boolean values.
The size of the arrays is always 50.
I want to have a collections of these arrays.
I will need to iterate over my collection multiple times.
In order to increase performance, I would like to limit each iteration to a subset of the collection. Rather than the whole collection.
For instance: to iterate only over the arrays that have FALSE in the 4th and 13th position
I will NOT need to search for TRUE values. Only for FALSE values in certain positions of the array.
Note that the possible subsets can share elements without one being included in the other.
Is there any kind of data structure that could help me?
There is a short piece on this in Knuth Volume III "Sorting and Searching" section 6.5. There are a few simple schemes laid out there with somewhat limited effectiveness, which suggests to me that there are no magic answers to this hard problem of searching multi-dimensional spaces.
One approach described here (with parameters adjusted for your problem) is to divide your 50 bits into 5 10-bit chunks, and create 5 hash tables, each mapping a value for one of those 10-bit chunks to a list of all the records which have that value in the given position. Given values for particular bit positions, choose whichever one of the 10-bit chunks contains the most known bits, and use this to only check 512 or 256 or 128 of the 1024 lists in the table for that chunk, depending on whether you know 1, 2, or 3 of the bit positions in that chunk.

Predict Huffman compression ratio without constructing the tree

I have a binary file and i know the number of occurrences of every symbol in it. I need to predict the length of compressed file IF i was to compress it using Huffman algorithm. I am only interested in the hypothetical output length and not in the codes for individual symbols, so constructing Huffman tree seems redundant.
As an illustration i need to get something like "A binary string of 38 bits which contains 4 a's, 5 b's and 10 c's can be compressed down to 28 bits.", except both the file and the alphabet size are much larger.
The basic question is: can it be done without constructing the tree?
Looking at the greedy algorithm: http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
it seems the tree can be constructed in n*log(n) time where n is the number of distinct symbols in the file. This is not bad asymptotically, but requires memory allocation for tree nodes and does a lot of work which in my case goes to waste.
the lower bound on the average number of bits per symbol in compressed file is nothing but the entropy H = -sum(p(x)*log(p(x))) for all symbols x in input. P(x) = freq(x)/(filesize). Using this compressed length(lower bound) = filesize*H. This is the lower bound on compressed size of file. But unfortunately the optimal entropy is not achievable in most case because bits are integral not fractional so in practical case the huffman tree is needed to be constructed to get correct compression size. But optimal compression size can be used to get the upper bound on amount of compression possible and to decide whether to use huffman or not.
You can upper bound the average bit counts per symbol in Huffman coding by
H(p1, p2, ..., pn) + 1 where H is entropy, each pi is probability of symbol i occuring in input. If you multiply this value by input size N it will give you approximated length of encoded output.
You could easily modify the algorithm to build the binary tree in an array. The root node is at index 0, its left node at index 1 and right node at index 2. In general, a node's children will be at (index*2) + 1 and (index * 2) + 2. Granted, this still requires memory allocation, but if you know how many symbols you have you can compute how many nodes will be in the tree. So it's a single array allocation.
I don't see where the work that's done really goes to waste. You have to keep track of the combining logic somehow, and doing it in a tree as shown is pretty simple. I know that you're just looking for the final answer--the length of each symbol--but you can't get that answer without doing the work.

Find duplicate strings in a large file

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates
erickson's answer is probably the one expected by whoever set this question.
You could use each of the N machines as a bucket in a hashtable:
for each string, (say string number i in sequence) compute a hash function on it, h.
send the the values of i and h to machine number n for storage, where n = h % N.
from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
check the sets of strings with equal hash values, to see whether they're actually equal.
To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.
Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

Resources