How is the priority chosen while constructing the Huffman Tree? - algorithm

Suppose my characters and their frequencies are as follows:
Char Freq.
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
When constructing a tree, at step 2 we have this:
[3] [3] [4] [5] [6] [7] [8]
/ \ c d e f g h
/ \
[1] [2]
a b
Now, since we have two 3s, how can we determine the priority of them?
In the Huffman Coding this is considered as:
[3] [3] [4] [5] [6] [7] [8]
c / \ d e f g h
/ \
[1] [2]
a b
Why?

What's the difference? Ignoring d through h for the moment, in the first case you'd get
a = 00
b = 01
c = 1
and in the second case,
a = 10
b = 11
c = 0
As long as c is at the same height in the final tree, its code will have the same length.

I would take c with bigger priority (shorter code). This would be in-line with the basic principle of Huffman trees: priority/shorter code for immediate results and lower priority for more parsing.

Your case is not interesting. The assignments of 0's and 1's to the branches is arbitrary, so the choice you outline results in the same code, i.e. the same code lengths, either way.
There are however interesting cases where you face a choice of three or more groups with the same total frequency and different shapes. Any choice will result in the same overall optimality, i.e. exactly the same number of total bits to encode the provided symbols at the provided frequencies. However the choices can result in different shape trees with different combinations of bit lengths. Then such a choice can be made to arrive at deeper or shallower trees, depending on what is desired.

Related

How to test a new random sequence in NIST Test Suite?

I have to test a random sequence using the NIST Test Suite. I have downloaded and run the tests on the files given in the data directory. It is working fine but when I tried to run it on a new random sequence, I am getting igmac: UNDERFLOW error. The random sequence is generated in Matlab using
bs=fix(randi([0 1],1,k));
and then saved as .dat file using
dlmwrite('bs.dat', bs);
I copied the bs.dat into the data folder and executed the test as follows. Can someone tell me what's wrong here?
ash#computer:~/Documents/NIST_Test_Original/sts-2.1.2$ ./assess 1000000
G E N E R A T O R S E L E C T I O N
______________________________________
[0] Input File [1] Linear Congruential
[2] Quadratic Congruential I [3] Quadratic Congruential II
[4] Cubic Congruential [5] XOR
[6] Modular Exponentiation [7] Blum-Blum-Shub
[8] Micali-Schnorr [9] G Using SHA-1
Enter Choice: 0
User Prescribed Input File: data/bs.txt
S T A T I S T I C A L T E S T S
_________________________________
[01] Frequency [02] Block Frequency
[03] Cumulative Sums [04] Runs
[05] Longest Run of Ones [06] Rank
[07] Discrete Fourier Transform [08] Nonperiodic Template Matchings
[09] Overlapping Template Matchings [10] Universal Statistical
[11] Approximate Entropy [12] Random Excursions
[13] Random Excursions Variant [14] Serial
[15] Linear Complexity
INSTRUCTIONS
Enter 0 if you DO NOT want to apply all of the
statistical tests to each sequence and 1 if you DO.
Enter Choice: 0
INSTRUCTIONS
Enter a 0 or 1 to indicate whether or not the numbered statistical
test should be applied to each sequence.
123456789111111
012345
110000000000000
P a r a m e t e r A d j u s t m e n t s
-----------------------------------------
[1] Block Frequency Test - block length(M): 128
Select Test (0 to continue): 0
How many bitstreams? 5
Input File Format:
[0] ASCII - A sequence of ASCII 0's and 1's
[1] Binary - Each byte in data file contains 8 bits of data
Select input mode: 0
Statistical Testing In Progress.........
igamc: UNDERFLOW
igamc: UNDERFLOW
igamc: UNDERFLOW
igamc: UNDERFLOW
igamc: UNDERFLOW
Statistical Testing Complete!!!!!!!!!!!!

For Ternary Huffman problem, can we make a tree (or encoding scheme) for "4" characters?

For Ternary Huffman problem, can we make a tree (or encoding scheme) for "4" characters?"
Say I have 4 characters with these frequencies:
freq(a)=5 freq(b)=3 freq(c)=2 freq(d)=2
How will I encode them in the form of 0,1,2 such that no code word is a prefix of another code word?
The standard algorithm for generating the optimal ternary Huffman code (as alluded to by rici) involves first making sure there are an odd number of symbols -- by adding a dummy symbol (of frequency 0) if necessary.
In this case, we start with an even number of symbols, so we need to add the dummy symbol that I call Z:
freq(a)=5 freq(b)=3 freq(c)=2 freq(d)=2 freq(Z)=0.
Then as Photon described, we repeatedly combine the 3 nodes with the lowest frequencies into 1 combined symbol. Each time we replace 3 nodes with 1 node, we reduce the total number of nodes by 2, and so the total number of nodes remains odd at each step. In the last step (if we've added the correct number of dummy symbols) we will combine 3 final nodes into a single root node.
abcdZ:12
/ | \
2/ 1| 0\
cdZ:4 b:3 a:5
/ | \
2/ 1| 0\
Z:0 d:2 c:2
So in this case one optimal (Huffman) ternary coding is:
a: 0
b: 1
c: 20
d: 21
Z: 22 (should never occur).
See
https://en.wikipedia.org/wiki/Huffman_coding#n-ary_Huffman_coding
for more details.
Well for classical huffman you just keep merging 2 lowest frequency nodes at a time to build a tree, when assign 1 to left (or right) edge and 0 to other edge and dfs path to some node is that nodes code.
i.e.
So in this case coding is:
a - 1
b - 01
c - 001
d - 000
On ternary huffman you just join nodes 3 lowest frequencies at a time (and less nodes if not enough nodes for last step)
i.e.
So in this case coding is:
a - 2
b - 12
c - 11
d - 10

Find the number of substrings in a string containing equal numbers of a, b, c

I'm trying to solve this problem. Now, I was able to get a recursive solution:
If DP[n] gives the number of beautiful substrings (defined in problem) ending at the nth character of the string, then to find DP[n+1], we scan the input string backward from the (n+1)th character until we find an ith character such that the substring beginning at the ith character and ending at the (n+1)th character is beautiful. If no such i can be found, DP[n+1] = 0.
If such a string is found then, DP[n+1] = 1 + DP[i-1].
The trouble is, this solution gives a timeout on one testcase. I suspect it is the scanning backward part that is problematic. The overall time complexity for my solution seems to be O(N^2). The size of the input data seems to indicate that the problem expects an O(NlogN) solution.
You don't really need dynamic programming for this; you can do it by iterating over the string once and, after each character, storing the state (the relative number of a's, b's and c's that were encountered so far) in a dictionary. This dictionary has maximum size N+1, so the overall time complexity is O(N).
If you find that at a certain point in the string there are e.g. 5 more a's than b's and 7 more c's than b's, and you find the same situation at another point in the string, then you know that the substring between those two points contains an equal number of a's, b's and c's.
Let's walk through an example with the input "dabdacbdcd":
a,b,c
-> 0,0,0
d -> 0,0,0
a -> 1,0,0
b -> 1,1,0
d -> 1,1,0
a -> 2,1,0
c -> 2,1,1 -> 1,0,0
b -> 1,1,0
d -> 1,1,0
c -> 1,1,1 -> 0,0,0
d -> 0,0,0
Because we're only interested in the difference between the number of a's, b'a and c's, not the actual number, we reduce a state like 2,1,1 to 1,0,0 by subtracting the lowest number from all three numbers.
We end up with a dictionary of these states, and the number of times they occur:
0,0,0 -> 4
1,0,0 -> 2
1,1,0 -> 4
2,1,0 -> 1
States which occur only once don't indicate an abc-equal substring, so we can discard them; we're then left with these repetitions of states:
4, 2, 4
If a state occurs twice, there is 1 abc-equal substring between those two locations. If a state occurs 4 times, there are 6 abc-equal substrings between them; e.g. the state 1,1,0 occurs at these points:
dab|d|acb|d|cd
Every substring between 2 of those 4 points is abc-equal:
d, dacb, dacbd, acb, acbd, d
In general, if a state occurs n times, it represents 1 + 2 + 3 + ... + n-1 abc-equal substrings (or easier to calculate: n-1 × n/2). If we calculate this for every count in the dictionary, the total is our solution:
4 -> 3 x 2 = 6
2 -> 1 x 1 = 1
4 -> 3 x 2 = 6
--
13
Let's check the result by finding what those 13 substrings are:
1 d---------
2 dabdacbdc-
3 dabdacbdcd
4 -abdacbdc-
5 -abdacbdcd
6 --bdac----
7 ---d------
8 ---dacb---
9 ---dacbd--
10 ----acb---
11 ----acbd--
12 -------d--
13 ---------d

Segment tree data position to tree position relation

I wonder if there is any relation between data_array data position to tree_array data position.
int data[N];
int tree[M]; // lets M = 2^X-1, where X = nearest ceiling power of 2 to N;
void build_segment_tree();
I wonder if I can say n'th value of data[] is mapped with i'th value of tree[]. is there any mathematical resolution?
You certainly can. For example segment tree is used for it's capapbility to store
segment information.
Now you will see that if you want to create a segment tree out of N elements then
you will need ceil(log_2(N))+1 levels. And in the last level you will find all the
1 length-range or the single elements.
These elements will be precisely in the position (1-index) 2^ceil(log_2(N)) to 2^ceil(log_2(N))+N-1.
[1-8]
/ \
[1-4] [5-8]
/ \ / \
[1-2][3-4] [5-6][7-8]
/\ /\ /\ /\
[1][2] [3][4] [5][6] [7][8]
1-11
/ \
1-6 7-11
1-3 4-6 7-9 10-11
1-2 3 4-5 6 7-8 9 10 11
1 2 4 5 7 8
This answer is for only valid for segment tree of power of 2 elements.
But for other elements the elements are not necessarily organized.
So the answer will be false for N those are not power of 2.
On that case you can't find any formualitve rule.

Ilustrate the left-most derivation on a token stream

I am trying to understand the left-most derivation in the context of LL parsing algorithm. This link explains it from the generative perspective. i.e. It shows how to follow left-most derivation to generate a specific token sequence from a set of rules.
But I am thinking about the opposite direction. Given a token stream and a set of grammar rules, how to find the proper steps to apply a set of rules by the left-most derivation?
Let's continue to use the following grammar from the aforementioned link:
And the given token sequence is: 1 2 3
One way is this:
1 2 3
-> D D D
-> N D D (rewrite the *left-most* D to N according to the rule N->D.)
-> N D (rewrite the *left-most* N D to N according to the rule N->N D.)
-> N (same as above.)
But there are other ways to apply the grammar rules:
1 2 3 -> D D D -> N D D -> N N D -> N N N
OR
1 2 3 -> D D D -> N D D -> N N D -> N N
But only the first derivation ends up in a single non-terminal.
As the token sequence length increase, there can be many more ways. I think to infer a proper deriving steps, 2 prerequisites are needed:
a starting/root rule
the token sequence
After giving these 2, what's the algorithm to find the deriving steps? Do we have to make the final result a single non-terminal?
The general process of LL parsing consists of repeatedly:
Predict the production for the top grammar symbol on the stack, if that symbol is a non-terminal, and replace that symbol with the right-hand side of the production.
Match the top grammar symbol on the stack with the next input symbol, discarding both of them.
The match action is unproblematic but the prediction might require an oracle. However, for the purposes of this explanation, the mechanism by which the prediction is made is irrelevant, provided that it works. For example, it might be that for some small integer k, every possible sequence of k input symbols is only consistent with at most one possible production, in which case you could use a look-up table. In that case, we say that the grammar is LL(k). But you could use any mechanism, including magic. It is only necessary that the prediction always be accurate.
At any step in this algorithm, the partially-derived string is the consumed input appended with the stack. Initially there is no consumed input and the stack consists solely of the start symbol, so that the the partially-derived string (which has had 0 derivations applied). Since the consumed input consists solely of terminals and the algorithm only ever modifies the top (first) element of the stack, it is clear that the series of partially-derived strings constitutes a leftmost derivation.
If the parse is successful, the entire input will be consumed and the stack will be empty, so the parse results in a leftmost derivation of the input from the start symbol.
Here's the complete parse for your example:
Consumed Unconsumed Partial Production
Input Stack input derivation or other action
-------- ----- ---------- ---------- ---------------
N 1 2 3 N N → N D
N D 1 2 3 N D N → N D
N D D 1 2 3 N D D N → D
D D D 1 2 3 D D D D → 1
1 D D 1 2 3 1 D D -- match --
1 D D 2 3 1 D D D → 2
1 2 D 2 3 1 2 D -- match --
1 2 D 3 1 2 D D → 3
1 2 3 3 1 2 3 -- match --
1 2 3 -- -- 1 2 3 -- success --
If you read the last two columns, you can see the derivation process starting from N and ending with 1 2 3. In this example, the prediction can only be made using magic because the rule N → N D is not LL(k) for any k; using the right-recursive rule N → D N instead would allow an LL(2) decision procedure (for example,"use N → D N if there are at least two unconsumed input tokens; otherwise N → D".)
The chart you are trying to produce, which starts with 1 2 3 and ends with N is a bottom-up parse. Bottom-up parses using the LR algorithm correspond to rightmost derivations, but the derivation needs to be read backwards, since it ends with the start symbol.

Resources