Converting to and from a number system that doesn't have a zero digit - algorithm

Consider Microsoft Excel's column-numbering system. Columns are "numbered" A, B, C, ... , Y, Z, AA, AB, AC, ... where A is 1.
The column system is similar to the base-10 numbering system that we're familiar with in that when any digit has its maximum value and is incremented, its value is set to the lowest possible digit value and the digit to its left is incremented, or a new digit is added at the minimum value. The difference is that there isn't a digit that represents zero in the letter numbering system. So if the "digit alphabet" contained ABC or 123, we could count like this:
(base 3 with zeros added for comparison)
base 3 no 0 base 3 with 0 base 10 with 0
----------- ------------- --------------
- - 0 0
A 1 1 1
B 2 2 2
C 3 10 3
AA 11 11 4
AB 12 12 5
AC 13 20 6
BA 21 21 7
BB 22 22 8
BC 23 100 9
CA 31 101 10
CB 32 102 11
CC 33 110 12
AAA 111 111 13
Converting from the zeroless system to our base 10 system is fairly simple; it's still a matter of multiplying the power of that space by the value in that space and adding it to the total. So in the case of AAA with the alphabet ABC, it's equivalent to (1*3^2) + (1*3^1) + (1*3^0) = 9 + 3 + 1 = 13.
I'm having trouble converting inversely, though. With a zero-based system, you can use a greedy algorithm moving from largest to smallest digit and grabbing whatever fits. This will not work for a zeroless system, however. For example, converting the base-10 number 10 to the base-3 zeroless system: Though 9 (the third digit slot: 3^2) would fit into 10, this would leave no possible configuration of the final two digits since their minimum values are 1*3^1 = 3 and 1*3^0 = 1 respectively.
Realistically, my digit alphabet will contain A-Z, so I'm looking for a quick, generalized conversion method that can do this without trial and error or counting up from zero.
Edit
The accepted answer by n.m. is primarily a string-manipulation-based solution.
For a purely mathematical solution see kennytm's links:
What is the algorithm to convert an Excel Column Letter into its Number?
How to convert a column number (eg. 127) into an excel column (eg. AA)

Convert to base-3-with-zeroes first (digits 0AB), and from there, convert to base-3-without-zeroes (ABC), using these string substitutions:
A0 => 0C
B0 => AC
C0 => BC
Each substitution either removes a zero, or pushes one to the left. In the end, discard leading zeroes.
It is also possible, as an optimisation, to process longer strings of zeros at once:
A000...000 = 0BBB...BBC
B000...000 = ABBB...BBC
C000...000 = BBBB...BBC
Generalizable to any base.

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

Algorithm for sequence calculation

I'm looking for a hint to an algorithm or pseudo code which helps me calculate sequences.
It's kind of permutations, but not exactly as it's not fixed length.
The output sequence should look something like this:
A
B
C
D
AA
BA
CA
DA
AB
BB
CB
DB
AC
BC
CC
DC
AD
BD
CD
DD
AAA
BAA
CAA
DAA
...
Every character above represents actually an integer, which gets incremented from a minimum to a maximum.
I do not know the depth when I start, so just using multiple nested for loops won't work.
It's late here in Germany and I just can't wrap my head around this. Pretty sure that it can be done with for loops and recursion, but I have currently no clue on how to get started.
Any ideas?
EDIT: B-typo corrected.
It looks like you're taking all combinations of four distinct digits of length 1, 2, 3, etc., allowing repeats.
So start with length 1: { A, B, C, D }
To get length 2, prepend A, B, C, D in turn to every member of length 1. (16 elements)
To get length 3, prepend A, B, C, D in turn to every member of length 2. (64 elements)
To get length 4, prepend A, B, C, D in turn to every member of length 3. (256 elements)
And so on.
If you have more or fewer digits, the same method will work. It gets a little trickier if you allow, say, A to equal B, but that doesn't look like what you're doing now.
Based on the comments from the OP, here's a way to do the sequence without storing the list.
Use an odometer analogy. This only requires keeping track of indices. Each time the first member of the sequence cycles around, increment the one to the right. If this is the first time that that member of the sequence has cycled around, then add a member to the sequence.
The increments will need to be cascaded. This is the equivalent of going from 99,999 to 100,000 miles (the comma is the thousands marker).
If you have a thousand integers that you need to cycle through, then pretend you're looking at an odometer in base 1000 rather than base 10 as above.
Your sequence looks more like (An-1 X AT) where A is a matrices and AT is its transpose.
A= [A,B,C,D]
AT X An-1 ∀ (n=0)
sequence= A,B,C,D
AT X An-1 ∀ (n=2)
sequence= AA,BA,CA,DA,AB,BB,CB,DB,AC,BC,CC,DC,AD,BD,CD,DD
You can go for any matrix multiplication code like this and implement what you wish.
You have 4 elements, you are simply looping the numbers in a reversed base 4 notation. Say A=0,B=1,C=2,D=3 :
first loop from 0 to 3 on 1 digit
second loop from 00 to 33 on 2 digits
and so on
i reversed i output using A,B,C,D digits
loop on 1 digit
0 0 A
1 1 B
2 2 C
3 3 D
loop on 2 digits
00 00 AA
01 10 BA
02 20 CA
03 30 DA
10 01 AB
11 11 BB
12 21 CB
13 31 DB
20 02 AC
21 12 BC
22 22 CC
...
The algorithm is pretty obvious. You could take a look at algorithm L (lexicographic t-combination generation) in fascicle 3a TAOCP D. Knuth.
How about:
Private Sub DoIt(minVal As Integer, maxVal As Integer, maxDepth As Integer)
If maxVal < minVal OrElse maxDepth <= 0 Then
Debug.WriteLine("no results!")
Return
End If
Debug.WriteLine("results:")
Dim resultList As New List(Of Integer)(maxDepth)
' initialize with the 1st result: this makes processing the remainder easy to write.
resultList.Add(minVal)
Dim depthIndex As Integer = 0
Debug.WriteLine(CStr(minVal))
Do
' find the term to be increased
Dim indexOfTermToIncrease As Integer = 0
While resultList(indexOfTermToIncrease) = maxVal
resultList(indexOfTermToIncrease) = minVal
indexOfTermToIncrease += 1
If indexOfTermToIncrease > depthIndex Then
depthIndex += 1
If depthIndex = maxDepth Then
Return
End If
resultList.Add(minVal - 1)
Exit While
End If
End While
' increase the term that was identified
resultList(indexOfTermToIncrease) += 1
' output
For d As Integer = 0 To depthIndex
Debug.Write(CStr(resultList(d)) + " ")
Next
Debug.WriteLine("")
Loop
End Sub
Would that be adequate? it doesn't take much memory and is relatively fast (apart from the writing to output...).

Confusion regarding genetic algorithms

My books(Artificial Intelligence A modern approach) says that Genetic algorithms begin with a set of k randomly generated states, called population. Each state is represented as a string over a finite alphabet- most commonly, a string of 0s and 1s. For eg, an 8-queens state must specify the positions of 8 queens, each in a column of 8 squares, and so requires 8 * log(2)8 = 24 bits. Alternatively the state could be represented as 8 digits, each in range from 1 to 8.
[ http://en.wikipedia.org/wiki/Eight_queens_puzzle ]
I don't understand the expression 8 * log(2)8 = 24 bits , why log2 ^ 8? And what are these 24 bits supposed to be for?
If we take first example on the wikipedia page, the solution can be encoded as [2,4,6,8,3,1,7,5] : the first digit gives the row number for the queen in column A, the second for the queen in column B and so on. Now instead of starting the row numbering at 1, we will start at 0. The solution is then encoded with [1,3,5,7,0,6,4]. Any position can be encoded such way.
We have only digits between 0 and 7, if we write them in binary 3 bit (=log2(8)) are enough :
000 -> 0
001 -> 1
...
110 -> 6
111 -> 7
A position can be encoded using 8 times 3 digits, e.g. from [1,3,5,7,2,0,6,4] we get [001,011,101,111,010,000,110,100] or more briefly 001011101111010000110100 : 24 bits.
In the other way, the bitstring 000010001011100101111110 decodes as 000.010.001.011.100.101.111.110 then [0,2,1,3,4,5,7,6] and gives [1,3,2,4,5,8,7] : queen in column A is on row 1, queen in column B is on row 3, etc.
The number of bits needed to store the possible squares (8 possibilities 0-7) is log(2)8. Note that 111 in binary is 7 in decimal. You have to specify the square for 8 columns, so you need 3 bits 8 times

What should I use as a check digit algorithm for a base 31 value?

I'm using the following set of values to create a 9 character long base 31 value:
0123456789ABCDEFGHJKLMNPQRTUWXY
I was looking at modifying the Luhn algorithm to work with my base.
My question is:
In base 10, the Luhn algorithm doubles each value in an even position and then if the result is >10 the individual digits of the result are added together.
Should I still be doubling my even position values, or using a higher multiplier?
I'm trying to protect against transposed characters, missing characters, extra characters and just plain wrong digits.
I looked into the Luhn mod N algorithm, but it is very limited in what it can validate against.
I decided to use a modified version of the Freight Container system.
The shipping container system multiples each value by 2^[position] (position starting from 0) and then performs a modulus 11 on the result to get a base 10 check digit (a result of 10 is not recommended).
In this case, the trick is to find values in the range x^0 to x^[length] which are not evenly divisible by the figure you use on the modulus.
I've decided to use 3^[position] as the multiplier and performing a modulus 31 on the sum to get the check digit.
As an example: 0369CFJMK
Character 0 3 6 9 C F J M K
Value 0 3 6 9 12 15 18 21 19
--------------------------------------------------------------
Multiplier 1 3 9 27 81 243 729 2187
Result 0 9 54 243 972 3645 13122 45927
Total 63972 MOD 31 = 19
It seems that with these sort of algorithms, the main requirement is that the multipler is not evenly divisible by the base and that the pattern of the remainders doesn't repeat within the length of the code you want to validate.
Don't reinvent the wheel - use Luhn mod N instead.

Resources