HuffmanCode variable bits length per character - bit

Is there a way to find just the “number of bit” of per character without drawing Huffman tree?
In other words is there a way to find the code length of character via “frequency” or “probability” of character?
Note: I want to use “variable-length code”.
please for explanation use following sentence:
“this is an example of a huffman tree”
For example “a” Huffman code has 3 bit length.
Following site have the Huffman tree, Huffman code and frequency of this sentence:
http://en.wikipedia.org/wiki/Huffman_coding

You can calculate a rough zero-order entropy of each symbol in a set of symbols. Given the set of symbols a_i, where the probability of each symbol is p_i (so the sum of the p_i is one), then the entropy of a_i in bits is -log2(p_i), where log2 is a logarithm base two. The average entropy of a symbol in bits is the sum over i of -p_i log2(p_i).
This gives a rough estimate of what you would get from Huffman or arithmetic zero-order coding of those symbols. The estimate will provide a lower bound, where both Huffman and arithmetic will not reach the bound due to estimations and, in the case of Huffman, the codes being limited to the resolution of a bit.

Related

How can I compute the information (or entropy) contained in a string of bits?

How can I compute the information (or entropy) contained in a string of bits?
Shannon's equation for entropy is not the solution, it just counts the 1 & 0, so for the string:
010101...
I get maximum entropy, even thou the string is highly ordered and hence it contains a small amount of information.
Auto-encoders seam helpful, but you can't tell how much information they "steal", in other words how much they over-fit. For example, if I train it on an N-bits long string with an N-1 input length, it would just memorize every single bit, resulting in 0 information as output.
I know that the best that a compression algorithm can do is to reach maximum entropy, and the resulting information is the length of the output string, but I don't think this generalizes well, plus what would such algorithm be?
Does it have something to do with Shannon's entropy computed on n-tuples for n = 1...N ?
My best guess is to use the matrix of probabilities that an n-bits long sub-string "i" will follow an other ("j").
Can anybody help?

Nearest neighbor searches in non-metric spaces

I would like to know about nearest neighbor search algorithms when working in non-metric spaces? In particular, is there any variant of a kd-tree algorithm in this setting with provable time complexity etc?
Probably more of theoretic interest for you:
The PH-Tree is similar to a quadtree, however, it transforms floating points coordinates into a non-metric system before storing them. The PH-Tree performs all queries (including kNN queries) on the non-metric data using a non-metric distance function (you can define your own distance functions on top of that).
In terms of kNN, the PH-Tree performs on par with trees like R+Trees and usually outperforms kd-trees.
The non-metric data storage appears to have little negative, possibly even positive, effect on performance, except maybe for the (almost negligible) execution time for the transformation and distance function.
The reason that the data is transformed comes from an inherent constraint of the tree: The tree is a bit-wise trie, which means it can only store bitsequences (can be seen as integer numbers). In order to store floating point numbers in the tree, we simply use the IEEE bit representation of the floating point number and interpret it as an integer (this works fine for positive number, negative numbers are a bit more complex). Crucially, this preserves the ordering, ie. if a floating point f1 is larger than f2, then the integer representation of the bits of int(f1) is also always larger than int(f2). Trivially, this transformation allows storing floating point numbers as integers without any loss of precision(!).
The transformation is non-metric, because the leading bits (after the sign bit) of a floating point number are the exponent bits, followed by the fraction bits. Clearly, if two number differ in their exponent bits, their distance grows exponentially faster (or slower for negative exponents) compared to distances cause by differences in the fraction bits.
Why did we use a bit-wise try? If we have d dimensions, it allows an easy transformation such that we can map the n'th bit of each of the d values of a coordinate into bit string with d bits. For example, for d=60, we get a 60 bit string. Assuming a CPU register width of 64 bits, this means we can perform many operations related to queries in constant time, i.e. many operations cost just one CPU operation, independent of whether we have 3 dimensions or 60 dimensions. It's probably hard to understand what's going on from this short text, more details on this can be found here.
NMSLIB provides a library for performing Nearest Neighor Search in non-metric spaces. That Github page provides a dozen of papers to read, but not all apply to non-metric spaces.
Unfortunately, there are few theoretical results regarding the complexity of Neaest Neighbor Search for non-metric spaces and there are no comprehensive empirical evaluations.
I can onsly see some theoretical results in Effective Proximity Retrieval
by Ordering Permutations, but I am not convinced. However, I suggest you take a look.
There seem to be not many people, if any, that uses k-d trees for non-metric spaces. They seem to use VP trees, etc. densitrees are also used, as described in Near Neighbor Search in Nonmetric Spaces.
Intuitively, densitrees are a class of decorated trees that hold the points of the dataset in a way similar to the metric tree. The critical difference
lies in the nature of tree decoration; instead of having one or several real values reflecting some bounds on the triangular inequality attached to every tree node, each densitree node is associated to a particular classifier called here a density estimator.

In huffman Compress, i don't know why no code will be longer than 16 bits, when all frequencies are scaled to fit within one byte

"Each code is a short integer because it can be proven that when all frequencies are scaled to fit within one byte, no code will be longer than 16 bits"
Does it mean that the depth of Huffman tree is 16?
If it is true, how to calculate depth of full binary tree?
If it isn't, What's the meaning of it?
Your excerpt is not complete somehow. The depth also depends on the number of symbols you are coding. If, for example, you are coding 100,000 different symbols, each of which occurs just once (where 1 fits quite easily in a byte), then you will need more than 16 bits per symbol. The depth of that tree will be 17.
What they are referring to is a worst-case set of frequencies that maximizes the code length that would result from the Huffman algorithm. That worst case set is a variant of the Fibonacci sequence.

Computing entropy/disorder

Given an ordered sequence of around a few thousand 32 bit integers, I would like to know how measures of their disorder or entropy are calculated.
What I would like is to be able to calculate a single value of the entropy for each of two such sequences and be able to compare their entropy values to determine which is more (dis)ordered.
I am asking here, as I think I may not be the first with this problem and would like to know of prior work.
Thanks in advance.
UPDATE #1
I have just found this answer that looks great, but would give the same entropy if the integers were sorted. It only gives a measure of the entropy of the individual ints in the list and disregards their (dis)order.
Entropy calculation generally:
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
Furthermore, you have to sort your integers, then iterate over the sorted integer list to find out the frequency of your integers. Afterwards, you can use the formula.
I think I'll have to code a shannon entropy in 2D. Arrange the list of 32 bit ints as a series of 8 bit bytes and do a Shannons on that, then to cover how ordered they may be, take the bytes eight at a time and form a new list of bytes composed of bits 0 of the eight followed by bits 1 of the eight ... bits 7 of the 8; then the next 8 original bytes ..., ...
I'll see how it goes/codes...
Entropy is a function on probabilities, not data (arrays of ints, or files). Entropy is a measure of disorder, but when the function is modified to take data as input it loses this meaning.
The only true way one can generate a measure of disorder for data is to use Kolmogorov Complexity. Though this has problems too, in particular it's uncomputable and is not yet strictly well defined as one must arbitrarily pick a base language. This well-definedness can be solved if the disorder one is measuring is relative to something that is going to process the data. So when considering compression on a particular computer, the base language would be Assembly for that computer.
So you could define the disorder of an array of integers as follows:
The length of the shortest program written in Assembly that outputs the array.

Given a number series, finding the Check Digit Algorithm...?

Suppose I have a series of index numbers that consists of a check digit. If I have a fair enough sample (Say 250 sample index numbers), do I have a way to extract the algorithm that has been used to generate the check digit?
I think there should be a programmatic approach atleast to find a set of possible algorithms.
UPDATE: The length of a index number is 8 Digits including the check digit.
No, not in the general case, since the number of possible algorithms is far more than what you may think. A sample space of 250 may not be enough to do proper numerical analysis.
For an extreme example, let's say your samples are all 15 digits long. You would not be able to reliably detect the algorithm if it changed the behaviour for those greater than 15 characters.
If you wanted to be sure, you should reverse engineer the code that checks the numbers for validity (if available).
If you know that the algorithm is drawn from a smaller subset than "every possible algorithm", then it might be possible. But algorithms may be only half the story - there's also the case where multipliers, exponentiation and wrap-around points change even using the same algorithm.
paxdiablo is correct, and you can't guess the algorithm without making any other assumption (or just having the whole sample space - then you can define the algorithm by a look up table).
However, if the check digit is calculated using some linear formula dependent on the "data digits" (which is a very common case, as you can see in the wikipedia article), given enough samples you can use Euler elimination.

Resources