How can I interpret the features obtained from Chem.RDKFingerprint(mol) - rdkit

I have done the following to obtain the fingerprint from mol file. By converting fp.ToBitString() gives me a vector of length 2048. When I count the 1's are the same as the number of atoms in a molecule. How can we interpret this vector? Any suggestions of link for the explanation would be great.
mol = Chem.MolFromSmiles(ms)
fp = Chem.RDKFingerprint(mol)
fp.ToBitString()
Here is the vector I obtained
'00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000'

From what I can gather the RDKFingerprint is a "Daylight-like" substructure fingerprint that uses a bit vector where each bit is set by the presence of a particular substructure within a molecule. The default settings (maxPath default=7) consider substructures that are a maximum of 7 bonds long. As there is no predefined substructure set, it is impossible to set a bit for each existing pattern so each key is considered as a seed to a pseudo-random number generator ('hashing'). The output of this is a set of bits (nBitsPerHash, default=2) with numbers between 0 and fpSize default=2048 which is used to set the corresponding bits in the fingerprint.
RDKit has a nice utility for interpreting the bits set:
from rdkit.Chem import Draw
from rdkit import Chem
smiles = 'OC(CN1C=NC=N1)(CN1C=NC=N1)C1=C(F)C=C(F)C=C1'
mol = Chem.MolFromSmiles(smiles)
bit_info = {}
fp = Chem.RDKFingerprint(mol, maxPath=5, bitInfo=bit_info)
print(list(fp.GetOnBits())[:10]) # print the first 10 bits set to 1
# using the bit_info dictionary populated by RDKit prepare a visualisation
Draw.DrawRDKitBit(mol, 60, bit_info)
# draw multiple bits (12)
tpls = [(mol, x, bit_info) for x in bit_info]
Draw.DrawRDKitBits(tpls[:12], molsPerRow=4, legends=[str(x) for x in bit_info][:12])
Some recommended reading:
RDKit Documentation
Daylight Fingerprint Theory
RDKit UGM Fingerprint Slides
RDKit Blogspot (Bit Drawing)

Related

EDIT: ES6 : rolling and testing my own PRNG hash hex key generator?

[EDIT: post was originnaly too long... trying to shorten it !]
for building some VNodes based app, I need a function that can generate pseudo random numbers, used as unique IDs, with a 3 or 4 bytes sizes.
Think a random RGB or RBBA, CSS-like string colors like : 'bada55' or '900dcafe'... User could seed the generator by picking a random pixel in a PNG image, for example.
Due to functionnal aspects of my app, generator must a pure function avpoding side effects like : using Math.random...
I'm not aware of artithmetic theories (my courses are so far in past...) but I decided to roll a custom PRNG, using Multiply-With-Carry (MWC) paradigm, and test it empirically with some prime coeffs, and random colors seeds.
My idea is to test it with 1-byte, 2-bytes, 3-bytes, and then 4-bytes outputs : my feelings are :
identifiying "good" primes and potential 'bad' seeds when number of bytes is lower
and try to test it against a cresing number of bytes
[EDITED FROM HERE]
MCW usually works as follow:
# each turn, compute the i-th byte:
Xi[i] = A * Zi[i] + Ci[i]
Ci[i] = Math.floor((A * Zi[i] + Ci[i]) / M)
Zi[i] = Xi[i] % M
where A is the multiplier, C the increment and M the modulus. For bytes the modulus is 256.
It's possible to determine mathematically A and C (prime numbers) to get a full cycle generator.
The trouble is that, when a byte cell starts to loop to its seed value, ALL cells do that... so the period is 256 for 4 bytes !
I need to build something like an odometer to shift values of othrt cells and garantuee a priod of Math.pow(256, 4).
How to achieve that in a simple way, if possible ?

Algorithm to align numerical sequences

Hi
I have two sequence of numerical data let's say :
S1 : 1,6,4,9,8,7,5 and S2 : 6,9,7,5
And i'd like to find a sequence alignment in both sense left-right and right-left.
So i used 2 techniques before asking i actually used the hungarian algorithm but it's not sequencial so it doesn't give good results And i used a modified version of the Needleman–Wunsch algorithm but i think i'm maybe doing it wrong or something and i've been digging for at least 4 months for anything that could help me but i only find genetic algorithms which may be helpful but i was wondering if there's a algorithm that exists that i may haven't seen yet ?
So to formalise my question : How would you align two positive numerical (integer or double) sequences ?
I believe you can accomplish your objective with the following:
import string
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
seq1 = "1649875"
seq2 = "6975"
numDict = {}
for x in range(0,10):
for y in range(0,10):
numDict[(str(x),str(y))] = -abs(x-y)
#print(numDict)
for a in pairwise2.align.globalds(seq1, seq2, numDict, -3, -1):
print(format_alignment(*a)) #prints alignment with best score
#for a in pairwise2.align.globalms(seq1, seq2, 5, -5, -3, -1):
print(format_alignment(*a))
The globalds alignment allows you to use a custom dictionary (in this case, I created a dictionary containing numbers ranging from 1-9 and found the absolute value of their difference when paired). If you just want a flat yes/no scoring system, you could do something like globalms, where a success is +5 and a failure is -5. Note, I advise using gap penalties when performing alignments. Also familiarize yourself with 'global' and 'local' alignments. More information on the Pairwise2 biopython module can be found here: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

random number generator with x,y coordinates as seed

I'm looking for a efficient, uniformly distributed PRNG, that generates one random integer for any whole number point in the plain with coordinates x and y as input to the function.
int rand(int x, int y)
It has to deliver the same random number each time you input the same coordinate.
Do you know of algorithms, that can be used for this kind of problem and also in higher dimensions?
I already tried to use normal PRNGs like a LFSR and merged the x,y coordinates together to use it as a seed value. Something like this.
int seed = x << 16 | (y & 0xFFFF)
The obvious problem with this method is that the seed is not iterated over multiple times but is initialized again for every x,y-point. This results in very ugly non random patterns if you visualize the results.
I already know of the method which uses shuffled permutation tables of some size like 256 and you get a random integer out of it like this.
int r = P[x + P[y & 255] & 255];
But I don't want to use this method because of the very limited range, restricted period length and high memory consumption.
Thanks for any helpful suggestions!
I found a very simple, fast and sufficient hash function based on the xxhash algorithm.
// cash stands for chaos hash :D
int cash(int x, int y){
int h = seed + x*374761393 + y*668265263; //all constants are prime
h = (h^(h >> 13))*1274126177;
return h^(h >> 16);
}
It is now much faster than the lookup table method I described above and it looks equally random. I don't know if the random properties are good compared to xxhash but as long as it looks random to the eye it's a fair solution for my purpose.
This is what it looks like with the pixel coordinates as input:
My approach
In general i think you want some hash-function (mostly all of these are designed to output randomness; avalanche-effect for RNGs, explicitly needed randomness for CryptoPRNGs). Compare with this thread.
The following code uses this approach:
1) build something hashable from your input
2) hash -> random-bytes (non-cryptographically)
3) somehow convert these random-bytes to your integer range (hard to do correctly/uniformly!)
The last step is done by this approach, which seems to be not that fast, but has strong theoretical guarantees (selected answer was used).
The hash-function i used supports seeds, which will be used in step 3!
import xxhash
import math
import numpy as np
import matplotlib.pyplot as plt
import time
def rng(a, b, maxExclN=100):
# preprocessing
bytes_needed = int(math.ceil(maxExclN / 256.0))
smallest_power_larger = 2
while smallest_power_larger < maxExclN:
smallest_power_larger *= 2
counter = 0
while True:
random_hash = xxhash.xxh32(str((a, b)).encode('utf-8'), seed=counter).digest()
random_integer = int.from_bytes(random_hash[:bytes_needed], byteorder='little')
if random_integer < 0:
counter += 1
continue # inefficient but safe; could be improved
random_integer = random_integer % smallest_power_larger
if random_integer < maxExclN:
return random_integer
else:
counter += 1
test_a = rng(3, 6)
test_b = rng(3, 9)
test_c = rng(3, 6)
print(test_a, test_b, test_c) # OUTPUT: 90 22 90
random_as = np.random.randint(100, size=1000000)
random_bs = np.random.randint(100, size=1000000)
start = time.time()
rands = [rng(*x) for x in zip(random_as, random_bs)]
end = time.time()
plt.hist(rands, bins=100)
plt.show()
print('needed secs: ', end-start)
# OUTPUT: needed secs: 15.056888341903687 -> 0,015056 per sample
# -> possibly heavy-dependence on range of output
Possible improvements
Add additional entropy from some source (urandom; could be put into str)
Make a class and initialize to memorize preprocessing (costly if done for each sampling)
Handle negative integers; maybe just use abs(x)
Assumptions:
the ouput-range is [0, N) -> just shift for others!
the output-range is smaller (bits) than the hash-output (may use xxh64)
Evaluation:
Check randomness/uniformity
Check if deterministic regarding input
You can use various randomness extractors to achieve your goals. There are at least two sources you can look for a solution.
Dodis et al, "Randomness Extraction and Key Derivation
Using the CBC, Cascade and HMAC Modes"
NIST SP800-90 "Recommendation for the Entropy Sources Used for
Random Bit Generation"
All in all, you can preferably use:
AES-CBC-MAC using a random key (may be fixed and reused)
HMAC, preferably with SHA2-512
SHA-family hash functions (SHA1, SHA256 etc); using a random final block (eg use a big random salt at the end)
Thus, you can concatenate your coordinates, get their bytes, add a random key (for AES and HMAC) or a salt for SHA and your output has an adequate entropy.
According to NIST, the output entropy relies on the input entropy:
Assuming you use SHA1; thus n = 160bits. Let's suppose that m = input_entropy (your coordinates' entropy)
if m >= 2n then output_entropy=n=160 bits
if 2n < m <= n then maximum output_entropy=m (but full entropy is not guaranteed).
if m < n then maximum output_entropy=m (this is your case)
see NIST sp800-90c (page 11)

image encryption using henon equation

i want to encrypt pixel value using henon equation :
Xi+2 = 1 - a*(Xi+1)*(Xi+1) + bXi (sorry i can't post image)
where a=1.4, b=0.3, x0=0.01, x1=0.02,
with this code :
k[i+2] =1-a*(Math.pow(k[i+1], 2))+b*k[i]
i can get random value from henon equation
1.00244,
-0.40084033504000005,
1.0757898361270288,
-0.7405053806319072,
0.5550494445953806,
0.3465365454865311,
0.99839222507778,
-0.2915408854881054,
1.1805231444476698,
-1.038551118053691,
-0.15586685140049938,
0.6544223990721852,
. after that i rounded the random value
with this code :
inter[i]= (int) Math.round((k[i]*65536)%256)
i can encrypt the pixel value by XOR with random value (henon).
my question :
there are some negative random value from henon, as we know that there aren't negative pixel value.
so may i skip the negative random value (only save positive random
value) to encrypt original pixel value ?
Thanks
You are using the Hénon sequence as a source for pseudo-random numbers, right?
Then you can of course chose to discard negative numbers (or take the absolute value, or do some other fancy thing) - as long as you do the same in encryption and decryption. If there is a specification, it should better be explicit about this.
Maybe you are using Javascript or some other language where % is not modulus, but remainder. If so, see this answer
Three other things to note:
Double-check that you are claculating the right thing. It seems to me that your calculation should read k[i+1] =1-a*(Math.pow(k[i], 2))+b*k[i], since the Hénon sequence only uses the last value.
`
Do you really need to store past values of k? If not, then just use
k =1-a*(Math.pow(k, 2))+b*k
or even better
k = 1 + k * (b - a *k)
(Spoiler warning: This may be the didactical point of an exercise.) The Hénon sequence is chaotic, and floating point errors will sooner or later influence the random numbers. So your random number generator maybe isn't as deterministic as you think.

How to calculate the entropy of a file?

How to calculate the entropy of a file? (Or let's just say a bunch of bytes)
I have an idea, but I'm not sure that it's mathematically correct.
My idea is the following:
Create an array of 256 integers (all zeros).
Traverse through the file and for each of its bytes,
increment the corresponding position in the array.
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference
to "average" to the counter.
Well, now I'm stuck. How to "project" the counter result in such a way
that all results would lie between 0.0 and 1.0? But I'm sure,
the idea is inconsistent anyway...
I hope someone has better and simpler solutions?
Note: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
At the end: Calculate the "average" value for the array.
Initialize a counter with zero,
and for each of the array's entries:
add the entry's difference to "average" to the counter.
With some modifications you can get Shannon's entropy:
rename "average" to "entropy"
(float) entropy = 0
for i in the array[256]:Counts do
(float)p = Counts[i] / filesize
if (p > 0) entropy = entropy - p*lg(p) // lgN is the logarithm with base 2
Edit:
As Wesley mentioned, we must divide entropy by 8 in order to adjust it in the range 0 . . 1 (or alternatively, we can use the logarithmic base 256).
A simpler solution: gzip the file. Use the ratio of file sizes: (size-of-gzipped)/(size-of-original) as measure of randomness (i.e. entropy).
This method doesn't give you the exact absolute value of entropy (because gzip is not an "ideal" compressor), but it's good enough if you need to compare entropy of different sources.
To calculate the information entropy of a collection of bytes, you'll need to do something similar to tydok's answer. (tydok's answer works on a collection of bits.)
The following variables are assumed to already exist:
byte_counts is 256-element list of the number of bytes with each value in your file. For example, byte_counts[2] is the number of bytes that have the value 2.
total is the total number of bytes in your file.
I'll write the following code in Python, but it should be obvious what's going on.
import math
entropy = 0
for count in byte_counts:
# If no bytes of this value were seen in the value, it doesn't affect
# the entropy of the file.
if count == 0:
continue
# p is the probability of seeing this byte in the file, as a floating-
# point number
p = 1.0 * count / total
entropy -= p * math.log(p, 256)
There are several things that are important to note.
The check for count == 0 is not just an optimization. If count == 0, then p == 0, and log(p) will be undefined ("negative infinity"), causing an error.
The 256 in the call to math.log represents the number of discrete values that are possible. A byte composed of eight bits will have 256 possible values.
The resulting value will be between 0 (every single byte in the file is the same) up to 1 (the bytes are evenly divided among every possible value of a byte).
An explanation for the use of log base 256
It is true that this algorithm is usually applied using log base 2. This gives the resulting answer in bits. In such a case, you have a maximum of 8 bits of entropy for any given file. Try it yourself: maximize the entropy of the input by making byte_counts a list of all 1 or 2 or 100. When the bytes of a file are evenly distributed, you'll find that there is an entropy of 8 bits.
It is possible to use other logarithm bases. Using b=2 allows a result in bits, as each bit can have 2 values. Using b=10 puts the result in dits, or decimal bits, as there are 10 possible values for each dit. Using b=256 will give the result in bytes, as each byte can have one of 256 discrete values.
Interestingly, using log identities, you can work out how to convert the resulting entropy between units. Any result obtained in units of bits can be converted to units of bytes by dividing by 8. As an interesting, intentional side-effect, this gives the entropy as a value between 0 and 1.
In summary:
You can use various units to express entropy
Most people express entropy in bits (b=2)
For a collection of bytes, this gives a maximum entropy of 8 bits
Since the asker wants a result between 0 and 1, divide this result by 8 for a meaningful value
The algorithm above calculates entropy in bytes (b=256)
This is equivalent to (entropy in bits) / 8
This already gives a value between 0 and 1
I'm two years late in answering, so please consider this despite only a few up-votes.
Short answer: use my 1st and 3rd bold equations below to get what most people are thinking about when they say "entropy" of a file in bits. Use just 1st equation if you want Shannon's H entropy which is actually entropy/symbol as he stated 13 times in his paper which most people are not aware of. Some online entropy calculators use this one, but Shannon's H is "specific entropy", not "total entropy" which has caused so much confusion. Use 1st and 2nd equation if you want the answer between 0 and 1 which is normalized entropy/symbol (it's not bits/symbol, but a true statistical measure of the "entropic nature" of the data by letting the data choose its own log base instead of arbitrarily assigning 2, e, or 10).
There 4 types of entropy of files (data) of N symbols long with n unique types of symbols. But keep in mind that by knowing the contents of a file, you know the state it is in and therefore S=0. To be precise, if you have a source that generates a lot of data that you have access to, then you can calculate the expected future entropy/character of that source. If you use the following on a file, it is more accurate to say it is estimating the expected entropy of other files from that source.
Shannon (specific) entropy H = -1*sum(count_i / N * log(count_i / N))
where count_i is the number of times symbol i occured in N.
Units are bits/symbol if log is base 2, nats/symbol if natural log.
Normalized specific entropy: H / log(n)
Units are entropy/symbol. Ranges from 0 to 1. 1 means each symbol occurred equally often and near 0 is where all symbols except 1 occurred only once, and the rest of a very long file was the other symbol. The log is in the same base as the H.
Absolute entropy S = N * H
Units are bits if log is base 2, nats if ln()).
Normalized absolute entropy S = N * H / log(n)
Unit is "entropy", varies from 0 to N. The log is in the same base as the H.
Although the last one is the truest "entropy", the first one (Shannon entropy H) is what all books call "entropy" without (the needed IMHO) qualification. Most do not clarify (like Shannon did) that it is bits/symbol or entropy per symbol. Calling H "entropy" is speaking too loosely.
For files with equal frequency of each symbol: S = N * H = N. This is the case for most large files of bits. Entropy does not do any compression on the data and is thereby completely ignorant of any patterns, so 000000111111 has the same H and S as 010111101000 (6 1's and 6 0's in both cases).
Like others have said, using a standard compression routine like gzip and dividing before and after will give a better measure of the amount of pre-existing "order" in the file, but that is biased against data that fits the compression scheme better. There's no general purpose perfectly optimized compressor that we can use to define an absolute "order".
Another thing to consider: H changes if you change how you express the data. H will be different if you select different groupings of bits (bits, nibbles, bytes, or hex). So you divide by log(n) where n is the number of unique symbols in the data (2 for binary, 256 for bytes) and H will range from 0 to 1 (this is normalized intensive Shannon entropy in units of entropy per symbol). But technically if only 100 of the 256 types of bytes occur, then n=100, not 256.
H is an "intensive" entropy, i.e. it is per symbol which is analogous to specific entropy in physics which is entropy per kg or per mole. Regular "extensive" entropy of a file analogous to physics' S is S=N*H where N is the number of symbols in the file. H would be exactly analogous to a portion of an ideal gas volume. Information entropy can't simply be made exactly equal to physical entropy in a deeper sense because physical entropy allows for "ordered" as well disordered arrangements: physical entropy comes out more than a completely random entropy (such as a compressed file). One aspect of the different For an ideal gas there is a additional 5/2 factor to account for this: S = k * N * (H+5/2) where H = possible quantum states per molecule = (xp)^3/hbar * 2 * sigma^2 where x=width of the box, p=total non-directional momentum in the system (calculated from kinetic energy and mass per molecule), and sigma=0.341 in keeping with uncertainty principle giving only the number of possible states within 1 std dev.
A little math gives a shorter form of normalized extensive entropy for a file:
S=N * H / log(n) = sum(count_i*log(N/count_i))/log(n)
Units of this are "entropy" (which is not really a unit). It is normalized to be a better universal measure than the "entropy" units of N * H. But it also should not be called "entropy" without clarification because the normal historical convention is to erringly call H "entropy" (which is contrary to the clarifications made in Shannon's text).
For what it's worth, here's the traditional (bits of entropy) calculation represented in C#:
/// <summary>
/// returns bits of entropy represented in a given string, per
/// http://en.wikipedia.org/wiki/Entropy_(information_theory)
/// </summary>
public static double ShannonEntropy(string s)
{
var map = new Dictionary<char, int>();
foreach (char c in s)
{
if (!map.ContainsKey(c))
map.Add(c, 1);
else
map[c] += 1;
}
double result = 0.0;
int len = s.Length;
foreach (var item in map)
{
var frequency = (double)item.Value / len;
result -= frequency * (Math.Log(frequency) / Math.Log(2));
}
return result;
}
Is this something that ent could handle? (Or perhaps its not available on your platform.)
$ dd if=/dev/urandom of=file bs=1024 count=10
$ ent file
Entropy = 7.983185 bits per byte.
...
As a counter example, here is a file with no entropy.
$ dd if=/dev/zero of=file bs=1024 count=10
$ ent file
Entropy = 0.000000 bits per byte.
...
There's no such thing as the entropy of a file. In information theory, the entropy is a function of a random variable, not of a fixed data set (well, technically a fixed data set does have an entropy, but that entropy would be 0 — we can regard the data as a random distribution that has only one possible outcome with probability 1).
In order to calculate the entropy, you need a random variable with which to model your file. The entropy will then be the entropy of the distribution of that random variable. This entropy will equal the number of bits of information contained in that random variable.
If you use information theory entropy, mind that it might make sense not to use it on bytes. Say, if your data consists of floats you should instead fit a probability distribution to those floats and calculate the entropy of that distribution.
Or, if the contents of the file is unicode characters, you should use those, etc.
Calculates entropy of any string of unsigned chars of size "length". This is basically a refactoring of the code found at http://rosettacode.org/wiki/Entropy. I use this for a 64 bit IV generator that creates a container of 100000000 IV's with no dupes and a average entropy of 3.9. http://www.quantifiedtechnologies.com/Programming.html
#include <string>
#include <map>
#include <algorithm>
#include <cmath>
typedef unsigned char uint8;
double Calculate(uint8 * input, int length)
{
std::map<char, int> frequencies;
for (int i = 0; i < length; ++i)
frequencies[input[i]] ++;
double infocontent = 0;
for (std::pair<char, int> p : frequencies)
{
double freq = static_cast<double>(p.second) / length;
infocontent += freq * log2(freq);
}
infocontent *= -1;
return infocontent;
}
Re: I need the whole thing to make assumptions on the file's contents:
(plaintext, markup, compressed or some binary, ...)
As others have pointed out (or been confused/distracted by), I think you're actually talking about metric entropy (entropy divided by length of message). See more at Entropy (information theory) - Wikipedia.
jitter's comment linking to Scanning data for entropy anomalies is very relevant to your underlying goal. That links eventually to libdisorder (C library for measuring byte entropy). That approach would seem to give you lots more information to work with, since it shows how the metric entropy varies in different parts of the file. See e.g. this graph of how the entropy of a block of 256 consecutive bytes from a 4 MB jpg image (y axis) changes for different offsets (x axis). At the beginning and end the entropy is lower, as it part-way in, but it is about 7 bits per byte for most of the file.
Source: https://github.com/cyphunk/entropy_examples. [Note that this and other graphs are available via the novel http://nonwhiteheterosexualmalelicense.org license....]
More interesting is the analysis and similar graphs at Analysing the byte entropy of a FAT formatted disk | GL.IB.LY
Statistics like the max, min, mode, and standard deviation of the metric entropy for the whole file and/or the first and last blocks of it might be very helpful as a signature.
This book also seems relevant: Detection and Recognition of File Masquerading for E-mail and Data Security - Springer
Here's a Java algo based on this snippet and the invasion that took place during the infinity war
public static double shannon_entropy(File file) throws IOException {
byte[] bytes= Files.readAllBytes(file.toPath());//byte sequence
int max_byte = 255;//max byte value
int no_bytes = bytes.length;//file length
int[] freq = new int[256];//byte frequencies
for (int j = 0; j < no_bytes; j++) {
int value = bytes[j] & 0xFF;//integer value of byte
freq[value]++;
}
double entropy = 0.0;
for (int i = 0; i <= max_byte; i++) {
double p = 1.0 * freq[i] / no_bytes;
if (freq[i] > 0)
entropy -= p * Math.log(p) / Math.log(2);
}
return entropy;
}
usage-example:
File file=new File("C:\\Users\\Somewhere\\In\\The\\Omniverse\\Thanos Invasion.Log");
int file_length=(int)file.length();
double shannon_entropy=shannon_entropy(file);
System.out.println("file length: "+file_length+" bytes");
System.out.println("shannon entropy: "+shannon_entropy+" nats i.e. a minimum of "+shannon_entropy+" bits can be used to encode each byte transfer" +
"\nfrom the file so that in total we transfer atleast "+(file_length*shannon_entropy)+" bits ("+((file_length*shannon_entropy)/8D)+
" bytes instead of "+file_length+" bytes).");
output-example:
file length: 5412 bytes
shannon entropy: 4.537883805240875 nats i.e. a minimum of 4.537883805240875 bits can be used to encode each byte transfer
from the file so that in total we transfer atleast 24559.027153963616 bits (3069.878394245452 bytes instead of 5412 bytes).
Without any additional information entropy of a file is (by definition) equal to its size*8 bits. Entropy of text file is roughly size*6.6 bits, given that:
each character is equally probable
there are 95 printable characters in byte
log(95)/log(2) = 6.6
Entropy of text file in English is estimated to be around 0.6 to 1.3 bits per character (as explained here).
In general you cannot talk about entropy of a given file. Entropy is a property of a set of files.
If you need an entropy (or entropy per byte, to be exact) the best way is to compress it using gzip, bz2, rar or any other strong compression, and then divide compressed size by uncompressed size. It would be a great estimate of entropy.
Calculating entropy byte by byte as Nick Dandoulakis suggested gives a very poor estimate, because it assumes every byte is independent. In text files, for example, it is much more probable to have a small letter after a letter than a whitespace or punctuation after a letter, since words typically are longer than 2 characters. So probability of next character being in a-z range is correlated with value of previous character. Don't use Nick's rough estimate for any real data, use gzip compression ratio instead.

Resources