Hashing a key by folding then dividing by a prime number? - algorithm

I want to hash the following key "LOWELL" using a simple hash function that used 3 steps :
Step 1: transform the key into a number.
LOWELL = | L | O | W | E | L | L | | | | | | |
ASCII code: 76 79 87 69 76 76 32 32 32 32 32 32
my question here why it added more 6 empty positions with fixed ASCII code 32
Step 2: fold and add (chop off pieces of the number and add
them together)
7679|8769|7676|3232|3232|3232|
7679+8769+7676+3232+3232+3232 = 33,820
Step 3: take the mod by a prime number
33,820 mod 19937 = 13,883
Another question here for why dividing by prime numbers i found this
answer : Dividing by a number is good when there are sequences of
consecutive numbers. If there are many dierent sequences of
consecutive numbers, dividing by a number that has many small factors
may result in lots of collisions. A prime number is a better choice. but i didn't get it
Step 4: divide by the size of the address space (preferably a prime number). 13,883 mod 101 = 46
finally why it divided the address space ?!
You can find detailed steps here (Slide 350)
Many Thanks in advance for helping

Since your address space contains only 101 slots, you cannot put your record in a position whose address exceeds this limit.
So you take the reminder of division of the output from the hash function (13,883 in your case) by the address space, to make sure that the location of the record falls in the allowed address space.
So h(s) % address_space will always give allowed position within your address space.
Regarding your first question, why do we use a prime number in hashing, this thread will help you:
Why use a prime number in hashCode?

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

How do I representation percentage in evolutionary Algorithm?

Considering I have 4 chromosomes (gi, i=1 to 4}) to represent 4 percentages of different things so that the sum of 4 percentages are equal to 100. How Do I represent this efficiently?
I know that it is possible by: g1/(g1+g2+g3+g4). However, This is not efficient. Consider all gi=0.2 or all gi=0.1 will represent 25% in these two cases. It is possible to generate many cases where different genes present same percentage. Is there any other efficient way, where unique set of combination of genes present unique set of percentages.
Thanks in advance.
I think you're confusing genes and chromosomes. A chromosome encodes a candidate solution to your problem. A gene is part of a chromosome.
Under this setting, why would you want that constraint on the chromosomes? it sounds like you want it on the genes of a chromosome.
In order to do this you can do a number of things: have each gene encode an integer in [0, 100]. If the genes do not add to 100 in the end, penalize the fitness of those chromosomes.
Another way, which might make crossover operators more natural to apply, is to have each gene store 100 bits. If x bits are set, that means the gene will encode x%.
Yet another way is to have the entire chromosome encode 100 set bits. Then each gene will hold a value x, which represents an interval. The number of set bits between two split points is the percentage associated to that gene. For example:
1 2 3 4 5 6 7 8 ... 100
1 1 1 1 1 1 1 1 ... 1
| | | | |
g1 g2 g3 g4
This can be done by generating 5 random numbers <= 100, sorting them and taking the differences between them.
One way to assign X units to N possibilities is to store X * (N-1) bits. Every unit is given (N-1) bits and if k of the (N-1) bits are set then the unit is assigned to k.
This is easy to work with as there are no invalid solutions and no penalties/repairs are necessary. This makes fitness evaluation, crossover and mutation easier to implement.
For example, the problem is to assign 5 units (X) to one of 4 (N) possibilities. Each individual is (4-1)x5=15 bits.
The bit string: 010 100 000 011 111 assigns the first 2 units to possibility 1 because both groups have 1 bit set. The third unit which has no bits set is assigned to 0. The fourth unit is assigned to 2 and the fifth to 3.
partition units
0 1
1 2
2 1
3 1

Confusion regarding genetic algorithms

My books(Artificial Intelligence A modern approach) says that Genetic algorithms begin with a set of k randomly generated states, called population. Each state is represented as a string over a finite alphabet- most commonly, a string of 0s and 1s. For eg, an 8-queens state must specify the positions of 8 queens, each in a column of 8 squares, and so requires 8 * log(2)8 = 24 bits. Alternatively the state could be represented as 8 digits, each in range from 1 to 8.
[ http://en.wikipedia.org/wiki/Eight_queens_puzzle ]
I don't understand the expression 8 * log(2)8 = 24 bits , why log2 ^ 8? And what are these 24 bits supposed to be for?
If we take first example on the wikipedia page, the solution can be encoded as [2,4,6,8,3,1,7,5] : the first digit gives the row number for the queen in column A, the second for the queen in column B and so on. Now instead of starting the row numbering at 1, we will start at 0. The solution is then encoded with [1,3,5,7,0,6,4]. Any position can be encoded such way.
We have only digits between 0 and 7, if we write them in binary 3 bit (=log2(8)) are enough :
000 -> 0
001 -> 1
...
110 -> 6
111 -> 7
A position can be encoded using 8 times 3 digits, e.g. from [1,3,5,7,2,0,6,4] we get [001,011,101,111,010,000,110,100] or more briefly 001011101111010000110100 : 24 bits.
In the other way, the bitstring 000010001011100101111110 decodes as 000.010.001.011.100.101.111.110 then [0,2,1,3,4,5,7,6] and gives [1,3,2,4,5,8,7] : queen in column A is on row 1, queen in column B is on row 3, etc.
The number of bits needed to store the possible squares (8 possibilities 0-7) is log(2)8. Note that 111 in binary is 7 in decimal. You have to specify the square for 8 columns, so you need 3 bits 8 times

about number of bits required for Fibonacci number

I am reading a algorithms book by S.DasGupta. Following is text snippet from the text regarding number of bits required for nth Fibonacci number.
It is reasonable to treat addition as
a single computer step if small
numbers are being added, 32-bit
numbers say. But the nth Fibonacci
number is about
0.694n bits long, and this can far exceed 32 as n grows. Arithmetic
operations on arbitrarily large
numbers cannot possibly be performed
in a single, constant-time step.
My question is for eg, for Fibonacci number F1 = 1, F2 =1, F3=2, and so on. then substituting "n" in above formula i.e., 0.694n for F1 is approximately 1, F2 is approximately 2 bits, but for F3 and so on above formula fails. I think i didn't understand propely what author mean here, can any one please help me in understanding this?
Thanks
Well,
n 3 4 5 6 7 8
0.694n 2.08 2.78 3.47 4.16 4.86 5.55
F(n) 2 3 5 8 13 21
bits 2 2 3 4 4 5
log(F(n)) 1 1.58 2.32 3 3.7 4.39
Bits required is the base-2 log rounded up, so this is close enough for me.
The value 0.694 comes from the fact that F(n) is the closest integer to (φn)/√5. So log(F(n)) is n * log(phi) - log(sqrt(5)), and log(phi) is 0.694. As n gets bigger, the log(sqrt(5)) and the rounding rapidly become insignificant.
private static int nobFib(int n) // number of bits Fib(n)
{
return n < 6 ? ++n/2 : (int)(0.69424191363061738 * n - 0.1609640474436813);
}
Checked it for n from 0 to 500.000, n=500.000.000, n=1.000.000.000
It's based on Binet's formula.
Needed it for: Fibonacci Sequence Binary Plot.
See: http://bigintegers.blogspot.com/2012/09/fibonacci-sequence-binary-plot-edd-peg.html
First of all, the word about is very important, as in the nth Fibonacci number is about 0.694n bits long. Second, I think the author means when n->infinity. Try some big number and check :)
you cant have say half a bit... the amount of bits must be rounded
so it means
number of bits = Math.ceil(Math.max(0.694*n,32));
so its rounded up for n>32 and 32 for n<32
for 32bit systems that is
and the number may not be exact
I think he's just using the Fibonacci numbers to illustrate his point that for large numbers (>32 bit) addition cannot be assumed to be constant anymore because it involves more than a singe instruction on the CPU.
Why does the formula fail? For F3=2 the binary representation needs 2bits (3 * 0.694 = 2.082) Take F50=12586269025, which can be represented using 33bits (50 * 0.694 = 35) which is still reasonably close to the true value.
N F(N) 0.694*N
1 0 1
2 1 1
3 1 1
4 2 2
5 3 2
6 5 3
7 8 4
8 13 4
etc. That's my interpretation. But then, that means that you have to get to f(47) = 1,836,311,903 before you exceed 32 bits.
The author is basically describing how large numbers affect the performance of the algorithm. To be overly simple, a processor can add numbers of the register size very quickly, if the numbers exceed the register size, more low level processor instructions need to be executed.

What should I use as a check digit algorithm for a base 31 value?

I'm using the following set of values to create a 9 character long base 31 value:
0123456789ABCDEFGHJKLMNPQRTUWXY
I was looking at modifying the Luhn algorithm to work with my base.
My question is:
In base 10, the Luhn algorithm doubles each value in an even position and then if the result is >10 the individual digits of the result are added together.
Should I still be doubling my even position values, or using a higher multiplier?
I'm trying to protect against transposed characters, missing characters, extra characters and just plain wrong digits.
I looked into the Luhn mod N algorithm, but it is very limited in what it can validate against.
I decided to use a modified version of the Freight Container system.
The shipping container system multiples each value by 2^[position] (position starting from 0) and then performs a modulus 11 on the result to get a base 10 check digit (a result of 10 is not recommended).
In this case, the trick is to find values in the range x^0 to x^[length] which are not evenly divisible by the figure you use on the modulus.
I've decided to use 3^[position] as the multiplier and performing a modulus 31 on the sum to get the check digit.
As an example: 0369CFJMK
Character 0 3 6 9 C F J M K
Value 0 3 6 9 12 15 18 21 19
--------------------------------------------------------------
Multiplier 1 3 9 27 81 243 729 2187
Result 0 9 54 243 972 3645 13122 45927
Total 63972 MOD 31 = 19
It seems that with these sort of algorithms, the main requirement is that the multipler is not evenly divisible by the base and that the pattern of the remainders doesn't repeat within the length of the code you want to validate.
Don't reinvent the wheel - use Luhn mod N instead.

Resources