Canonical huffman encoding algo

Canonical huffman encoding algo - algorithm

Hello I am trying to implement Canonical huffman encoding but i dont understand wiki and google guides,
I need explain more abstractly...
I tried this:
1. Get list of regular huffman encoding length's codes. like this:
A - code: 110, length: 3.
B - code: 111, length: 3.
C - code: 10, length 2.
D - code: 01, length 2.
E - code: 00, length 2.
I sorting the table by symbol and length like this:
C - code: 10, length 2.
D - code: 01, length 2.
E - code: 00, length 2.
A - code: 110, length: 3.
B - code: 111, length: 3.
now i dont know how to proceed...
tnx a lot

Throw out the codes you get from the Huffman algorithm. You don't need those. Just keep the lengths.
Now assign the codes based on the lengths and the symbols. Sort by length, from shortest to longest, and within each length, sort the symbols in ascending order. (How you do that exactly doesn't matter, so long as every symbol is strictly less than or greater than any other symbol, and the encoder and decoder agree on how to do it.)
So we do the ordering:
C - 2
D - 2
E - 2
A - 3
B - 3
Two's come before three's, and within the 2's, C, D, E are in order, and within the 3's, A, B are in order.
Now we assign the code in integer order within each length, adding a zero bit at the end each time we go up a length:
C - 2 - 00
D - 2 - 01
E - 2 - 10
A - 3 - 110 <- after incrementing to 11, a zero was added to make 110
B - 3 - 111
That is a canonical code.
You could do it other ways if you like and still be canonical, e.g. counting backwards from 11, so long as the encoder and decoder agree on the approach. The whole point is to only have to transmit the lengths for each symbol from the encoder to the decoder, so as to not have to transmit the codes themselves which take more space.

You should sort symbols by there frequency, so most often would be on top and least often would be on bottom. (Overall frequency - 1):
A (0.5)
B (0.2)
C (0.15)
D (0.15)
Then mark one symbol with 0 and other with 1, summ there frequencies and insert into proper position in list and again mark two least with 0 and 1:
A (0.5) A (0.5)
B (0.2) C&D (0.3) 0
C (0.15) 0 B (0.2) 1
D (0.15) 1
And again...
A (0.5) A (0.5) A (0.5) 0
B (0.2) C&D (0.3) 0 B&C&D (0.5) 1
C (0.15) 0 B (0.2) 1
D (0.15) 1
Until you obtain last pair.
The path, marked by 0 and 1 from tail to symbol would be corresponding Huffman code:
A 0
B 11
C 100
D 101

Related

Ilustrate the left-most derivation on a token stream

I am trying to understand the left-most derivation in the context of LL parsing algorithm. This link explains it from the generative perspective. i.e. It shows how to follow left-most derivation to generate a specific token sequence from a set of rules.
But I am thinking about the opposite direction. Given a token stream and a set of grammar rules, how to find the proper steps to apply a set of rules by the left-most derivation?
Let's continue to use the following grammar from the aforementioned link:
And the given token sequence is: 1 2 3
One way is this:
1 2 3
-> D D D
-> N D D (rewrite the *left-most* D to N according to the rule N->D.)
-> N D (rewrite the *left-most* N D to N according to the rule N->N D.)
-> N (same as above.)
But there are other ways to apply the grammar rules:
1 2 3 -> D D D -> N D D -> N N D -> N N N
OR
1 2 3 -> D D D -> N D D -> N N D -> N N
But only the first derivation ends up in a single non-terminal.
As the token sequence length increase, there can be many more ways. I think to infer a proper deriving steps, 2 prerequisites are needed:
a starting/root rule
the token sequence
After giving these 2, what's the algorithm to find the deriving steps? Do we have to make the final result a single non-terminal?

The general process of LL parsing consists of repeatedly:
Predict the production for the top grammar symbol on the stack, if that symbol is a non-terminal, and replace that symbol with the right-hand side of the production.
Match the top grammar symbol on the stack with the next input symbol, discarding both of them.
The match action is unproblematic but the prediction might require an oracle. However, for the purposes of this explanation, the mechanism by which the prediction is made is irrelevant, provided that it works. For example, it might be that for some small integer k, every possible sequence of k input symbols is only consistent with at most one possible production, in which case you could use a look-up table. In that case, we say that the grammar is LL(k). But you could use any mechanism, including magic. It is only necessary that the prediction always be accurate.
At any step in this algorithm, the partially-derived string is the consumed input appended with the stack. Initially there is no consumed input and the stack consists solely of the start symbol, so that the the partially-derived string (which has had 0 derivations applied). Since the consumed input consists solely of terminals and the algorithm only ever modifies the top (first) element of the stack, it is clear that the series of partially-derived strings constitutes a leftmost derivation.
If the parse is successful, the entire input will be consumed and the stack will be empty, so the parse results in a leftmost derivation of the input from the start symbol.
Here's the complete parse for your example:
Consumed Unconsumed Partial Production
Input Stack input derivation or other action
-------- ----- ---------- ---------- ---------------
N 1 2 3 N N → N D
N D 1 2 3 N D N → N D
N D D 1 2 3 N D D N → D
D D D 1 2 3 D D D D → 1
1 D D 1 2 3 1 D D -- match --
1 D D 2 3 1 D D D → 2
1 2 D 2 3 1 2 D -- match --
1 2 D 3 1 2 D D → 3
1 2 3 3 1 2 3 -- match --
1 2 3 -- -- 1 2 3 -- success --
If you read the last two columns, you can see the derivation process starting from N and ending with 1 2 3. In this example, the prediction can only be made using magic because the rule N → N D is not LL(k) for any k; using the right-recursive rule N → D N instead would allow an LL(2) decision procedure (for example,"use N → D N if there are at least two unconsumed input tokens; otherwise N → D".)
The chart you are trying to produce, which starts with 1 2 3 and ends with N is a bottom-up parse. Bottom-up parses using the LR algorithm correspond to rightmost derivations, but the derivation needs to be read backwards, since it ends with the start symbol.

Generating permutations for all possible lengths [duplicate]

I'd like to know a possible algorithm to calculate all possible combinations, without repetitions, starting from length=1 until length=N of N elements.
Example:
Elements: 1, 2, 3.
Output:
1
2
3
12
13
23
123

Look at the binary presentation of the numbers 0 to 2^n - 1.
n = 3
i Binary Combination
CBA
0 000
1 001 A
2 010 B
3 011 A B
4 100 C
5 101 A C
6 110 B C
7 111 A B C
So you just have to enumerate the numbers 1 to 2^n - 1 and look at the binary representation to know which elements to include. If you want to have them ordered by the number of elements post sort them or generate the numbers in order (there are several example on SO).

Converting a number into a special base system

I want to convert a number in base 10 into a special base form like this:
A*2^2 + B*3^1 + C*2^0
A can take on values of [0,1]
B can take on values of [0,1,2]
C can take on values of [0,1]
For example, the number 8 would be
1*2^2 + 1*3 + 1.
It is guaranteed that the given number can be converted to this specialized base system.
I know how to convert from this base system back to base-10, but I do not know how to convert from base-10 to this specialized base system.

In short words, treat every base number (2^2, 3^1, 2^0 in your example) as weight of an item, and the whole number as the capacity of a bag. This problem wants us to find a combination of these items which they fill the bag exactly.
In the first place this problem is NP-complete. It is identical to the subset sum problem, which can also be seen as a derivative problem of the knapsack problem.
Despite this fact, this problem can however be solved by a pseudo-polynomial time algorithm using dynamic programming in O(nW) time, which n is the number of bases, and W is the number to decompose. The details can be find in this wikipedia page: http://en.wikipedia.org/wiki/Knapsack_problem#Dynamic_programming and this SO page: What's it called when I want to choose items to fill container as full as possible - and what algorithm should I use?.

Simplifying your "special base":
X = A * 4 + B * 3 + C
A E {0,1}
B E {0,1,2}
C E {0,1}
Obviously the largest number that can be represented is 4 + 2 * 3 + 1 = 11
To figure out how to get the values of A, B, C you can do one of two things:
There are only 12 possible inputs: create a lookup table. Ugly, but quick.
Use some algorithm. A bit trickier.
Let's look at (1) first:
A B C X
0 0 0 0
0 0 1 1
0 1 0 3
0 1 1 4
0 2 0 6
0 2 1 7
1 0 0 4
1 0 1 5
1 1 0 7
1 1 1 8
1 2 0 10
1 2 1 11
Notice that 2 and 9 cannot be expressed in this system, while 4 and 7 occur twice. The fact that you have multiple possible solutions for a given input is a hint that there isn't a really robust algorithm (other than a look up table) to achieve what you want. So your table might look like this:
int A[] = {0,0,-1,0,0,1,0,1,1,-1,1,1};
int B[] = {0,0,-1,1,1,0,2,1,1,-1,2,2};
int C[] = {0,1,-1,0,2,1,0,1,1,-1,0,1};
Then look up A, B, C. If A < 0, there is no solution.

Algorithm in hardware to find out if number is divisible by five

I am trying to think of an algorithm to implement this for a given n bit binary number. I tried out many examples, but am unable to find out any pattern. So how shall I proceed?

How about this:
Convert the number to base 4 (this is trivial by simply combining pairs of bits). 5 in base 4 is 11. The values base 4 that are divisible by 11 are somewhat familiar: 11, 22, 33, 110, 121, 132, 203, ...
The rule for divisibility by 11 is that you add all the odd digits and all the even digits and subtract one from the other. If the result is divisible by 11 (which remember is 5), then it's divisible by 11 (which remember is 5).
For example:
123456d = 1 1110 0010 0100 0000b = 132021000_4
The even digits are 1 2 2 0 0 : sum = 5d
The odd digits are 3 0 1 0 : sum = 4d
Difference is 1, which is not divisble by 5
Or another one:
123455d = 1 1110 0010 0011 1111b = 132020333_4
The even digits are 1 2 2 3 3 : sum = 11d
The odd digits are 3 0 0 3 : sum = 6d
Difference is 5, which is a 5 or a 0
This should have a fairly efficient HW implementation because it's mostly bit-slicing, followed by N/2 adders, where N is the number of bits in the number you're interested in.
Note that after adding the digits and subtracting, the maximum value is 3/4 * N, so if you have 16-bit numbers max, you can get at most 12 as a result, so you only need to check for 0, ±5 and ±10 explicitly. If you're using 32-bit numbers then you can get at most 24 as a result, so you need to also check if the result is ±15 or ±20.

Make a Deterministic Finite Automaton (DFA) to implement the divisibility check and implement the DFA in hardware.
Creating a DFA for divisibility by 5 is easy. You just need to notice the remainders and check what 2r (mod 5) and 2r + 1(mod 5) map to. There are many websites that discuss this. For example this one.
There are well-known examples to convert DFA to a hardware representation as well.

Well , I just figured out ...
number mod 5 = a0 * 2^0 mod 5 + a1 * 2^1 mod 5 +a2* 2^2 mod 5 + a3 * 2^3 mod 5 + a4 * 2^4 mod 5 + ....
= a0 (1) + a1(2) +a2 (-1) +a3 (-2) +a4 (1) repeats ...
Hence difference of odd digits + 2 times difference of even digits = divisible by 5
for example ... consider 110010
odd digits differnce = 0-0+1 = 1 or 01
even digits difference = 1-0+1 = 2 or 10
difference of odd digits + 2 times difference of even digits = 01 + 2*(10)=01 + 100 = 101 is divisible by 5 .

The contribution of each bit toward being divisible by five is a four bit pattern 3421.
You could shift through any binary number 4 bits at a time adding the corresponding value for positive bits.
Example:
100011
take 0011
apply the pattern 0021
sum 3
next four bits 0010
apply the pattern 0020
sum = 5

We can design a Deterministic Finite Automaton (DFA) for the same. The DFA, then can be implemented in Hardware. This is similar to this answer.
We will simulate a Deterministic Finite Automaton (DFA) that accepts Binary Representation of Integers which are divisible by 5
Now, by accept, we mean that when we are done with scanning string, we should be in one of the multiple possible Final States.
Approach to Design DFA : Essentially, we need to divide the Binary Representation of Integer by 5, and track the remainder. If after consuming/scanning [From Left to Right] the entire string, remainder is Zero, then we should end up in Final State, and if remainder isn't zero we should be in Non-Final States.
Now, DFA is defined by Quintuple/5-Tuple (Q,q₀,F,Σ,δ). We will obtain these five components step-by-step.
Q : Finite Set of States
We need to track remainder. On dividing any integer by 5, we can get remainder as 0,1, 2, 3 or 4. Hence, we will have Five States Z, O, T, Th and F for each possible remainder.
Q={Z, O, T, Th, F}
If after scanning certain part of Binary String, we are in state Z, this means that integer defined from Left to this part will give remainder Zero when divided by 5. Similarly, O for remainder One, and so on.
Now, we can write these three states by Euclidean Division Algorithm as
Z : 5m
O : 5m+1
T : 5m+2
Th : 5m+3
F : 5m+4
where m is Integer.
q₀ : an initial/start state from set Q
Now, start state can be thought in terms of empty string (ɛ). An ɛ directly gets into q₀.
What remainder does ɛ gives when divided by 5?
We can append as many 0s in left hand side of a Binary Number. In the similar fashion, we can append ɛ in left hand side of a Binary String. Thus, ɛ in left can be thought of as 0. And 0 when divided by 5 gives remainder 0. Hence, ɛ should end in State Z. But ɛ ends up in q₀.
Thus, q₀=Z
F : a set of accept states
Now we want all strings which are divisible by 5, or which gives remainder 0 when divided by 5, or which after complete scanning should end up in state Z, and gets accepted.
Hence,
F={Z}
Σ : Alphabet (a finite set of input symbols)
Since we are scanning/reading a Binary String. Hence,
Σ={0,1}
δ : Transition Function (δ : Q × Σ → Q)
Now this δ tells us that if we are in state x (in Q) and next input to be scanned is y (in Σ), then at which state z (in Q) should we go.
If the string upto this point gives remainder 3/Th when divided by 5, and if we append 1 to string, then what remainder will resultant string give.
Now, this can be analyzed by observing how magnitude of a binary string changes on appending 0 and 1.
a.
In Decimal (Base-10), if we add/append 0, then magnitude gets multiplied by 10 . 53, on appending 0 it becomes 530
Also, if we append 8 to decimal, then Magnitude gets multiplied by 10, and then we add 8 to multiplied magnitude.
b.
In Binary (Base-2), if we add/append 0, then magnitude gets multiplied by 2 (The Positional Weight of each Bit get multiplied by 2)
Example : (1010)2 [which is (10)10], on appending 0 it becomes (10100)2 [which is (20)10]
Similarly, In Binary, if we append 1, then Magnitude gets multiplied by 2, and then we add 1.
Example : (10)2 [which is (2)10], on appending 1 it becomes (101)2 [which is (5)10]
Thus, we can say that for Binary String x,
x0=2|x|
x1=2|x|+1
We will use these relation to analyze Five States
Any string in Z can be written as 5m
- On 0, it becomes 2(5m), which is 5(2m), nothing but state Z.
- On 1, it becomes 2(5m)+1, which is 5(2m)+1, that is O. [This can be read as if a Binary String is presently divisible by 5, and we append 1, then resultant string will give remainder as 1]
Any string in O can be written as 5m+1
- On 0, it becomes 2(5m+1) = 10m+2, which is 5(2m)+2, state T.
- On 1, it becomes 2(5m+1)+1 = 10m+3, which is 5(2m)+3, that is state Th.
Any string in T can be written as 5m+2
- On 0, it becomes 2(5m+2) = 10m+4, which is 5(2m)+4, state F.
- On 1, it becomes 2(5m+2)+1 = 10m+5, which is 5(2m+1), state Z. [If m is integer, so is (2m+1)]
Any string in Th can be written as 5m+3
- On 0, it becomes 2(5m+3) = 10m+6, which is 5(2m+1)+1, state V.
- On 1, it becomes 2(5m+3)+1 = 10m+7, which is 5(2m+1)+2, that is state T.
Any string in F can be written as 5m+4
- On 0, it becomes 2(5m+4) = 10m+8, which is 5(2m+1)+3, state Th.
- On 1, it becomes 2(5m+4)+1 = 10m+9, which is 5(2m+1)+4, that is state F.
Hence, the final DFA combining Everything (creating using Tool)
We can even write code [in High Level Language] for the same. But it would go beyond main aim of this question. If readers wish to see the same, they can check here.

As any assignment this would have been an answer for is bound to be way overdue a year later:
in the binary representation of a natural divisible by five the parities of bits 4n and 4n+2 equal, as well as those for bits 4n+1 and 4n+3.
(This is entirely equivalent to the answers of JoshG79, notsogeek, or james: 4≡-1(mod 5), 3≡-2(mod 5) (with reduced hand-waving about recursion in argumentation, and no dispensable handling of carries in circuitry))

Calculate unique integer identifier for elements in an hierarchical structure based on its position

We have a hierarchical structure like this :
- 1 ( identifier = 100 )
- 1 (101)
- 2 (102)
- 3 (103)
- 1 (1031)
- 2 (1032)
- 3 (1033)
- 4 (104)
- 1 (1041)
- 2 (1042)
- 3 (1043)
-901 (1001)
- 2 (200)
- 1 (201)
- 2 (202)
- 10 (1000)
- 1 (1001)
Required characteristics :
the identifiers of each node should be unique.
the identifiers should increment accordingly to level of the element
the identifiers should be of Integer type.
the counter of each element resets at each new level/parent element
As you can see in the example of elements 1.901 and 10.1 , the current implementation doesn't work.
We've tried next solutions :
multiplying each level by number.
multiplying only the first level by a number and add each child
It becomes much more easier in case the identifier is a String, in this case we can use next way : "level1.level2.level3...", so for 1 -> 1 it will be "1.1" and so on. But this is the most unwanted step.
So, could you please suggest any algorithm that can be used here to generate required identifiers ?
Update Fixed the example. P.S. I knew that it's wrong.

You're too hung up on decimal numbers.
Choose a value X which represents the maximum number of child nodes within a node, or any number greater than that that you find convenient to work with.
Drop your adherence to decimal numbers, and represent all identifiers as integers represented in base X.
Encode an identifier as an integer in base X where the first digit represents the top-level of the node tree, the second digit represents the second level, and so forth.
So, if you were lucky and a reasonable value for X turned out to be 16 you could use hexadecimal representations of integers. If 36 is a good value, use any alphanumeric character as a digit.
EDIT
As Rafael has pointed out this approach breaks down if it is not possible to define an upper limit on the number of children a node can have. In my experience this is unlikely to be a serious problem in practice.
If the value of X is large, say 863 then I'd suggest the obvious implementation would be to set X = 1000 and use groups of 3 decimal digits to represent each digit in base 1000. This way the identifier 12.245.1 would be represented as 12245001.
And now we're into territory already covered by alestanis' answer

First of all, your example is wrong (100 maps to identifier 10000), but this doesn't mean it works. Here"s why:
- 1 (100)
-1 (101)
...
-100 (200)
- 2 (200)
What you need to have a working example is indeed to multiply each level by a number N or to say the same, reserve X digits for each level. In your case, you chose N = 100, so the extra constraint you need to make your example work is that each level cannot have more than N-1 children.
If you enforce this constraint, element 1.100 would be illegal, thus eliminating duplicate identifiers.
Using N = 100 (or X = 2) could yield:
- 1 (01)
- 1 (0101)
- 1 (010101)
...
- 3 (010103)
...
- 99 (0199)
- 100 is ILLEGAL
- 2 (02)
- 42 (42)
- 42 (4242)
The "problem" then would be to choose X wisely so that you minimize the number of needed digits without constraining your user. For instance, if you know you will never add more than 450 children, choose X=3.

This problem has a simple but memory-wise costly solution:
Each node position can be easily represented as an unique finite sequence of integers i(1), i(2), ..., i(n):
node 1 = [1]
node 1.1 = [1, 1]
node 7.2.42 = [7, 2, 42]
Thus the question can be represented as how to map each finite sequence of integers to a unique integer. That can be done using prime numbers
p(1) = 2
p(2) = 3
p(3) = 5
...
Just multiply the i(n):th powers of primes p(n). The uniqueness of the resulting integer is guaranteed by the uniqueness of prime number factorization. See Wikipedia:
http://en.wikipedia.org/wiki/Integer_factorization
Examples:
node 1 : 2^1 = 2
node 1.1 : 2^1 * 3^1 = 6
node 7.2.42 : 2^7 * 3^2 * 5^42 = (huge number!)
Another possible solution
Begin with the string-representation of the node (e.g. "7.2.42"). Use octal numbers for node numbering. Use '8' (the first unused digit for octals) to separate levels instead of '.'. Use the resulting string as a decimal integer.
Examples:
node 1 : 1
node 1.1 : 181
node 1.2 : 182
node 1.3 : 183
node 7.7.7 : 78787
node 8 : 10
node 8.1 : 1081
node 8.2 : 1082
node 8.9 : 10811
node 8.9.1: 1081181
node 8.9.2: 1081182

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio