Determin Depth of Huffman Tree using Input Character pattern (or Frequency)? - algorithm

I'd like to as a variation on this question regarding Huffman tree building. Is there anyway to calculate the depth of a Huffman tree from the input (or frequency), without drawing tree.
if there is no quick way, How the answer of that question was found? Specific Example is : 10-Input Symbol with Frequency 1 to 10 is 5.

If you are looking for an equation to take the frequencies and give you the depth, then no, no such equation exists. The proof is that there exist sets of frequencies on which you will have arbitrary choices to make in applying the Huffman algorithm that result in different depth trees! So there isn't even a unique answer to "What is the depth of the Huffman tree?" for some sets of frequencies.
A simple example is the set of frequencies 1, 1, 2, and 2, which can give a depth of 2 or 3 depending on which minimum frequencies are paired when applying the Huffman algorithm.
The only way to get the answer is to apply the Huffman algorithm. You can take some shortcuts to get just the depth, since you won't be using the tree at the end. But you will be effectively building the tree no matter what.
You might be able to approximate the depth, or at least put bounds on it, with an entropy equation. In some special cases the bounds may be restrictive enough to give you the exact depth. E.g. if all of the frequencies are equal, then you can calculate the depth to be the ceiling of the log base 2 of the number of symbols.
A cool example that shows that a simple entropy bound won't be strong enough to get the exact answer is when you use the Fibonacci sequence for the frequencies. This assures that the depth of the tree is the number of symbols minus one. So the frequencies 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, and 610 will result in a depth of 14 bits even though the entropy of the lowest frequency symbol is 10.64 bits.

Related

Sorting algorithm based on subset inversion

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter.
Like this:
http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png
So we can't simply swap numbers without inverting everything in between.
We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.
This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):
[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];
We're looking at two separate problems:
1) To sort a list of 25 numbers with the least amount of inversions
2) To sort a list of 25 numbers with the least amount of numbers moved
We also want to establish both upper and lower bounds for both.
We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!
UPDATE: the method we currently use is based on the above method
but instead works both ways. It looks at the next elements needed
for both ends (e.g. 1 and 25 at the beginning) and then calculates
which inversion would be cheapest. All values at the ends can be
ignored for the rest of the algorithm because they get put into the
correct place immediately. Our first method took 18/19 steps and 148
genes, and this one does it in 17 steps and 101 genes. For both
optimalisation tactics (the two mentioned above), this is a better
method. It is however not cheaper in terms of code and processing.
Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!
Thanks in advance!
Regarding the first question: Do you know (and care about) which of the two strands the genes are on?
If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.
If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.
Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)
I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).
The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,
5,6,1,2,3,4,7,8
we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),
2,1,6,5,3,4,7,8
and then [2,3] with [4,5] (inverting [2,3,4,5]),
2,1,4,3,5,6,7,8
and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.
So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

Check for duplicate subsequences of length >= N in sequence

I have a sequence of values and I want to know if it contains an repeated subsequences of a certain minimum length. For instance:
1, 2, 3, 4, 5, 100, 99, 101, 3, 4, 5, 100, 44, 99, 101
Contains the subsequence 3, 4, 5, 100 twice. It also contains the subsequence 99, 101 twice, but that subsequence is two short to care about.
Is there an efficient algorithm for checking the existence of such a subsequence? I'm not especially interested in location the sequences (though that would be helpful for verification), I'm primarily just interested in a True/False answer, given a sequence and a minimum subsequence length.
My only approach so far is to brute force search it: for each item in the sequence, find all the other locations where the item occurs (already at O(N^2)), and then walk forward one step at a time from each location and see if the next item matches, and keep going until I find a mismatch or find a matching subsequence of sufficient length.
Another thought I had but haven't been able to develop into an actual approach is to build a tree of all the sequences, so that each number is a node, and a child of its the number that preceded it, whereever that node happens to already be in the tree.
There are O(k) solutions (k - the length of the whole sequence) for any value of N.
Solution #1: Build a suffix tree for the input sequence(using Ukkonen's algorithm). Iterate over the nodes with two or more children and check if at least one of them has depth >= N.
Solution #2: Build a suffix automaton for the input sequence.Iterate over all the states which right context contains at least two different strings and check if at least one of those nodes has distance >= N from the initial state of the automaton.
Solution #3:Suffix array and the longest common prefix technique can also be used(build the suffix array for input sequence , compute the longest common prefix array, check that there is a pair of adjacent suffices with common prefix with length at least N).
These solutions have O(k) time complexity under the assumption that alphabet size is constant(alphabet consists of all elements of the input sequence).
If it is not the case, it is still possible to obtain O(k log k) worst case time complexity(by storing all transitions in a tree or in an automaton in a map) or O(k) on average using hashmap.
P.S I use terms string and sequence interchangeably here.
If you only care about subsequences of length exactly N (for example, if just want to check that there are no duplicates), then there is a quadratic solution: use the KMP algorithm for every subsequence.
Let's assume that there are k elements in the whole sequence.
For every subsequence of length N (O(k) of them):
Build its failure function (takes O(N))
Search for it in the remainder of the sequence (takes O(k))
So, assuming N << k, the whole algorithm is indeed O(k^2).
Since your list is unordered, you're going to have to visit every item at least once.
What I'm thinking is that you first go through your list and create a dictionary where you store the number as a key along with all the indices it appears in your sequence. Like:
Key: Indices
1: 0
2: 1
3: 2, 8
....
Where the number 1 appears at index 0, the number 2 appears at index 1, the number 3 appears at indices 2 and 8, and so on.
With that created you can then go through the dictionary keys and start comparing it against the sequences at the other locations. This should save on some of the brute force since you don't have to revisit each number through the initial sequence each time.

Something like a reversed random number generator

I really don't know what the name of this problem is, but it's something like lossy compression, and I have a bad English, but I will try to describe it as much as I can.
Suppose I have list of unsorted unique numbers from unknown source, the length is usually between 255 to 512 with a range from 0 to 512.
I wonder if there is some kind of an algorithm that reads the data and return something like a seed number that I can use to generate a list somehow close to the original but with some degree of error.
For example
original list
{5, 13, 25, 33, 3, 10}
regenerated list
{4, 10, 30, 30, 5, 5} or {8, 20, 20, 35, 5, 9} //and so on
Does this problem have a name, and is there an algorithm that can do what I just described?
Is it the same as Monte Carlo method because from what I understand it isn't.
Is it possible to use some of the techniques used in lossy compression to get this kind of approximation ?
What I tried to do to solve this problem is to use a simple 16 bit RNG and brute-force all the possible values comparing them to the original list and pick the one with the minimum difference, but I think this way is rather dumb and inefficient.
This is indeed lossy compression.
You don't tell us the range of the values in the list. From the samples you give we can extrapolate that they count at least 6 bits each (0 to 63). In total, you have from 0 to 3072 bits to compress.
If these sequences have no special property and appear to be random, I doubt there is any way to achieve significant compression. Think that the probability of an arbitrary sequence to be matched from a 32 bits seed is 2^32.2^(-3072)=7.10^(-916), i.e. less than infinitesimal. If you allow 10% error on every value, the probability of a match is 2^32.0.1^512=4.10^(-503).
A trivial way to compress with 12.5% accuracy is to get rid of the three LSB of each value, leading to 50% savings (1536 bits), but I doubt this is what you are looking for.
Would be useful to measure the entropy of the sequences http://en.wikipedia.org/wiki/Entropy_(information_theory) and/or possible correlations between the values. This can be done by plotting all (V, Vi+1) pairs, or (Vi, Vi+1, Vi+2) triples and looking for patterns.

Find electric circuit with given resistance

I need some help with the following problem:
Given a set of resistances, need to construct circuit with given resistance (i.e. we choose some resistors and construct circuit). Only parallel and sequential connection are allowed. So, the formal definition of such circuit is the following:
Circuit = Resistance | (Sequential (Circuit) (Circuit a)) |
(Parallel (Circuit) (Circuit))
The total number of circuits with N unlabeled resistors (where all resistors are used) is A000084 (Thanks Axel Kemper). But in my case resistors are labeled and I don't know how to check all circuits efficiently.
Number of resistors is about 15, is it possible to solve this problem?
UPD. Resistors may have different resistance. And of course, some resistances can't be achieved, in such case we just say that there is no solutions.
Integer sequence A000084 lists the Number of series-parallel networks with n unlabeled edges. Also called yoke-chains by Cayley and MacMahon. MacMahon's paper is online.
The first 15 elements of the sequence:
1, 2, 4, 10, 24, 66, 180, 522, 1532, 4624, 14136, 43930, 137908, 437502, 1399068
If the resistors have different resistance values, they are not "unlabeled".
The number of different overall-resistances is less than the number of networks.
Looking at the numbers, brute-force enumeration is probably feasible for moderate values of n.
It is not possible to match every conceivable total resistance exactly. As mentioned in a comment: The number of 15 resistors might be too small to reach the required value. Other example: If all 15 restors have 1 ohm each, the total resistance cannot be smaller than 1/15 ohm.
Look on page 70 of Analytic Combinatorics to find an illustration of the equivalence between a tree, a bracketed expression and a series-parallel graph:
Like mentioned in one of the comments, a search procedure like A* could be used to search the space of possible trees. The tree representation of the series-parallel network is also useful to determine the source-to-sink resistance with a simple recursive function.

Designing a data structure that works in O(logn) time

I'm auditing this algorithms class for work and I'm trying to do some practice problems given in class. This problem has me stumped and I just can't wrap my head around it. None of my solutions come out in O(logn) time. Can anyone help me with this problem??
Question:
Suppose that we are given a sequence of n values x1, x2, ... , xn in an arbitrary order and
seek to quickly answer repeated queries of the form: given an arbitrary pair i and j with
1 ≤ i < j ≤ n, find the smallest value in x1, ... , xj . Design a data structure that uses O(n) space and answers each query in O(log n) time.
For input of a1,a2,a3,...an , construct a node that contains minimum of (a1,..,ak) and minimum of (ak+1,..,an) where k = n/2.
Recursively construct the rest of the tree.
Now, if you want to find the minimum between ai and aj:
Identify the lowest common ancestor of i,j. Let it be k
Start with i and keep moving until you hit k. AT every iteration check if the child node was left node. If yes, then compare the right subtree's min and update current min accordingly.
Similarly, for j, check if it is right node....
At node k compare values returned by each subtree and return the min
People are overthinking this. Suppose that you start with the list:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
Make the following list of lists:
47, 13, 55, 29, 56, 9, 17, 48, 69, 15
13, 29, 9, 17, 15
13, 9, 15
9, 15
9
I leave the construction of these lists, correct usage, and proof that they provide an answer to the original question as exercises for the reader. (It might not be homework for you, but it could easily be for someone, and I don't like giving complete answers to homework questions.)
I think the crucial step is that you'll need to sort the data before hand. Then you can store the data in an array/list. Then you can run through a quick binary search in O(logn), picking out the first value that satisfies the condition (I'm assuming you meant between xi and xj, not x1 and xj).
edit: on second thought, ensuring that the value satisfies the condition may not be as trivial as I thought
The question was asked before in a slightly different way: What data structure using O(n) storage with O(log n) query time should I use for Range Minimum Queries?
Nevertheless, to quickly answer, the problem you're facing it's a well studied one - Range Minimum Query. A Segment Tree is a Data Structure that can solve the problem with O(N) space and O(logN) time requirements. You can see more details in here, where there's an explanation of the structure and the complexities involved.
Trying to explain the suggested data structure:
For every pair of numbers, calculate and keep the value of the smaller one.
For every four consecutive numbers, calculate and keep the value of the smallest of the four. This is done quickly by picking the smaller of the two pair values.
For every eight consecutive numbers, calculate and keep the value of the smallest of the eight.
And so on.
Let's say we want the smallest value of x19 to x65.
We look at the following stored values:
Smallest of x32 to x63.
Smallest of x24 to x31.
Smallest of x20 to x23.
x19.
Smallest of x64 to x65.
Then we pick the smallest of these.

Resources