Convert a two-letter String to a 3-digit number - algorithm

I am working on a software problem and I found myself needing to convert a 2-letter string to a 3-digit number. We're talking about English alphabet only (26 letters).
So essentially I need to convert something like AA, AR, ZF, ZZ etc. to a number in the range 0-999.
We have 676 combinations of letters and 1000 numbers, so the range is covered.
Now, I could just write up a map manually, saying that AA = 1, AB = 2 etc., but I was wondering if maybe there is a better, more "mathematical" or "logical" solution to this.
The order of numbers is of course not relevant, as long as the conversion from letters to numbers is unique and always yields the same results.
The conversion should work both ways (from letters to numbers and from numbers to letters).
Does anyone have an idea?
Thanks a lot

Treat A-Z as 1-26 in base 27, with 0 reserved for blanks.
E.g. 'CD' -> 3 * 27 + 4 = 85
85 -> 85 / 27, 85 % 27 = 3, 4 = C, D

If you don’t have to use consecutive numbers, you can view a two-letter string as a 36-based number. So, you can just use the int function to convert it into an Integer.
int('AA', 36) # 370
int('AB', 36) # 371
#...
int('ZY', 36) # 1294
int('ZZ', 36) # 1295
As for how to convert the number back to a string, you can refer to the method on How to convert an integer to a string in any base?
#furry12 because the diff between the first number and the last one is 1295-370=925<999. It is quite lucky, so you can minus every number for like 300, the results will be in the range of 0-999
def str2num(s):
return int(s, 36) - 300
print(str2num('AA')) # 70
print(str2num('ZZ')) # 995

Related

hpack encoding integer significance

After reading this, https://httpwg.org/specs/rfc7541.html#integer.representation
I am confused about quite a few things, although I seem to have the overall gist of the idea.
For one, What are the 'prefixes' exactly/what is their purpose?
For two:
C.1.1. Example 1: Encoding 10 Using a 5-Bit Prefix
The value 10 is to be encoded with a 5-bit prefix.
10 is less than 31 (2^5 - 1) and is represented using the 5-bit prefix.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 0 | 1 | 0 | 1 | 0 | 10 stored on 5 bits
+---+---+---+---+---+---+---+---+
What are the leading Xs? What is the starting 0 for?
>>> bin(10)
'0b1010'
>>>
Typing this in the python IDE, you see almost the same output... Why does it differ?
This is when the number fits within the number of prefix bits though, making it seemingly simple.
C.1.2. Example 2: Encoding 1337 Using a 5-Bit Prefix
The value I=1337 is to be encoded with a 5-bit prefix.
1337 is greater than 31 (25 - 1).
The 5-bit prefix is filled with its max value (31).
I = 1337 - (25 - 1) = 1306.
I (1306) is greater than or equal to 128, so the while loop body executes:
I % 128 == 26
26 + 128 == 154
154 is encoded in 8 bits as: 10011010
I is set to 10 (1306 / 128 == 10)
I is no longer greater than or equal to 128, so the while loop terminates.
I, now 10, is encoded in 8 bits as: 00001010.
The process ends.
0 1 2 3 4 5 6 7
+---+---+---+---+---+---+---+---+
| X | X | X | 1 | 1 | 1 | 1 | 1 | Prefix = 31, I = 1306
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1306>=128, encode(154), I=1306/128
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 10<128, encode(10), done
+---+---+---+---+---+---+---+---+
The octet-like diagram shows three different numbers being produced... Since the numbers are produced throughout the loop, how do you replicate this octet-like diagram within an integer? What is the actual final result? The diagram or "I" being 10, or 00001010.
def f(a, b):
if a < 2**b - 1:
print(a)
else:
c = 2**b - 1
remain = a - c
print(c)
if remain >= 128:
while 1:
e = remain % 128
g = e + 128
remain = remain / 128
if remain >= 128:
continue
else:
print(remain)
c+=int(remain)
print(c)
break
As im trying to figure this out, I wrote a quick python implementation of it, It seems that i am left with a few useless variables, one being g which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
For one, What are the 'prefixes' exactly/what is their purpose?
Integers are used in a few places in HPACK messages and often they have leading bits that cannot be used to for the actual integer. Therefore, there will often be a few leading digits that will be unavailable to use for the integer itself. They are represented by the X. For the purposes of this calculation it doesn't make what those Xs are: could be 000, or 111, or 010 or...etc. Also, there will not always be 3 Xs - that is just an example. There could only be one leading X, or two, or four...etc.
For example, to look up a previous HPACK decoded header, we use 6.1. Indexed Header Field Representation which starts with a leading 1, followed by the table index value. Therefore that 1 is the X in the previous example. We have 7-bits (instead of only 5-bits in the original example in your question). If the table index value is 127 or less we can represent it using those 7-bits. If it's >= 127 then we need to do some extra work (we'll come back to this).
If it's a new value we want to add to the table (to reuse in future requests), but we already have that header name in the table (so it's just a new value for that name we want as a new entry) then we use 6.2.1. Literal Header Field with Incremental Indexing. This has 2 bits at the beginning (01 - which are the Xs), and we only have 6-bits this time to represent the index of the name we want to reuse. So in this case there are two Xs.
So don't worry about there being 3 Xs - that's just an example. In the above examples there was one X (as first bit had to be 1), and two Xs (as first two bits had to be 01) respectively. The Integer Representation section is telling you how to handle any prefixed integer, whether prefixed by 1, 2, 3... etc unusable "X" bits.
What are the leading Xs? What is the starting 0 for?
The leading Xs are discussed above. The starting 0 is just because, in this example we have 5-bits to represent the integers and only need 4-bits. So we pad it with 0. If the value to encode was 20 it would be 10100. If the value was 40, we couldn't fit it in 5-bits so need to do something else.
Typing this in the python IDE, you see almost the same output... Why does it differ?
Python uses 0b to show it's a binary number. It doesn't bother showing any leading zeros. So 0b1010 is the same as 0b01010 and also the same as 0b00001010.
This is when the number fits within the number of prefix bits though, making it seemingly simple.
Exactly. If you need more than the number of bits you have, you don't have space for it. You can't just use more bits as HPACK will not know whether you are intending to use more bits (so should look at next byte) or if it's just a straight number (so only look at this one byte). It needs a signal to know that. That signal is using all 1s.
So to encode 40 in 5 bits, we need to use 11111 to say "it's not big enough", overflow to next byte. 11111 in binary is 31, so we know it's bigger than that, so we'll not waste that, and instead use it, and subtract it from the 40 to give 9 left to encode in the next byte. A new additional byte gives us 8 new bits to play with (well actually only 7 as we'll soon discover, as the first bit is used to signal a further overflow). This is enough so we can use 00001001 to encode our 9. So our complex number is represented in two bytes: XXX11111 and 00001001.
If we want to encode a value bigger than can fix in the first prefixed bit, AND the left over is bigger than 127 that would fit into the available 7 bits of the second byte, then we can't use this overflow mechanism using two bytes. Instead we use another "overflow, overflow" mechanism using three bytes:
For this "overflow, overflow" mechanism, we set the first byte bits to 1s as usual for an overflow (XXX11111) and then set the first bit of the second byte to 1. This leaves 7 bits available to encode the value, plus the next 8 bits in the third byte we're going to have to use (actually only 7 bits of the third byte, because again it uses the first bit to indicate another overflow).
There's various ways they could go have gone about this using the second and third bytes. What they decided to do was encode this as two numbers: the 128 mod, and the 128 multiplier.
1337 = 31 + (128 * 10) + 26
So that means the frist byte is set to 31 as per pervious example, the second byte is set to 26 (which is 11010) plus the leading 1 to show we're using the overflow overflow method (so 100011010), and the third byte is set to 10 (or 00001010).
So 1337 is encoded in three bytes: XXX11111 100011010 00001010 (including setting X to whatever those values were).
Using 128 mod and multiplier is quite efficient and means this large number (and in fact any number up to 16,383) can be represented in three bytes which is, not uncoincidentally, also the max integer that can be represented in 7 + 7 = 14 bits). But it does take a bit of getting your head around!
If it's bigger than 16,383 then we need to do another round of overflow in a similar manner.
All this seems horrendously complex but is actually relatively simply, and efficiently, coded up. Computers can do this pretty easily and quickly.
It seems that i am left with a few useless variables, one being g
You are not print this value in the if statement. Only the left over value in the else. You need to print both.
which in the documentation is the 26 + 128 == 154.
Lastly, where does 128 come from? I can't find any relation between the numbers besides the fact 2 raised to the 7th power is 128, but why is that significant? Is this because the first bit is reserved as a continuation flag? and an octet contains 8 bits so 8 - 1 = 7?
Exactly, it's because the first bit (value 128) needs to be set as per explanation above, to show we are continuing/overflowing into needing a third byte.

Why does `"0a".to_i(16)` return `10`?

I'm confused about the optional argument for to_i.
Specifically, what "base" means, and how it impacts the method in this example:
"0a".to_i(16) #=> 10
I have trouble with the optional argument in regards to the string the method is called on. I thought that the return value would just be an integer value of 0.
Simple answer: It's because 0a or a in Hexadecimal is equal to 10 in Decimal.
And base, in other word Radix means the number of unique digits in a numeral system.
In Decimal, we have 0 to 9, 10 digits to represent numbers.
In Hexadecimal, there're 16 digits instead, apart from 0 to 9, we use a to f to represent the conceptual numbers of 10 to 15.
You can test it like this:
"a".to_i(16)
#=> 10
"b".to_i(16)
#=> 11
"f".to_i(16)
#=> 15
"g".to_i(16)
#=> 0 # Because it's not a correct hexadecimal digit/number.
'2c'.to_i(16)
#=> 44
'2CH2'.to_i(16)
#=> 44 # Extraneous characters past the end of a valid number are ignored, and it's case insensitive.
9.to_s.to_i(16)
#=> 9
10.to_s.to_i(16)
#=> 16
In other words, 10 in Decimal is equal to a in Hexadecimal.
And 10 in Hexadecimal is equal to 16 in Decimal. (Doc for to_i)
Note that usually we use 0x precede to Hexadecimal numbers:
"0xa".to_i(16)
#=> 10
"0x100".to_i(16)
#=> 256
Btw, you can just use these representations in Ruby:
num_hex = 0x100
#=> 256
num_bin = 0b100
#=> 4
num_oct = 0o100
#=> 64
num_dec = 0d100
#=> 100
Hexadecimal, binary, octonary, decimal (this one, 0d is superfluous of course, just use in some cases for clarification.)

Converting to and from a number system that doesn't have a zero digit

Consider Microsoft Excel's column-numbering system. Columns are "numbered" A, B, C, ... , Y, Z, AA, AB, AC, ... where A is 1.
The column system is similar to the base-10 numbering system that we're familiar with in that when any digit has its maximum value and is incremented, its value is set to the lowest possible digit value and the digit to its left is incremented, or a new digit is added at the minimum value. The difference is that there isn't a digit that represents zero in the letter numbering system. So if the "digit alphabet" contained ABC or 123, we could count like this:
(base 3 with zeros added for comparison)
base 3 no 0 base 3 with 0 base 10 with 0
----------- ------------- --------------
- - 0 0
A 1 1 1
B 2 2 2
C 3 10 3
AA 11 11 4
AB 12 12 5
AC 13 20 6
BA 21 21 7
BB 22 22 8
BC 23 100 9
CA 31 101 10
CB 32 102 11
CC 33 110 12
AAA 111 111 13
Converting from the zeroless system to our base 10 system is fairly simple; it's still a matter of multiplying the power of that space by the value in that space and adding it to the total. So in the case of AAA with the alphabet ABC, it's equivalent to (1*3^2) + (1*3^1) + (1*3^0) = 9 + 3 + 1 = 13.
I'm having trouble converting inversely, though. With a zero-based system, you can use a greedy algorithm moving from largest to smallest digit and grabbing whatever fits. This will not work for a zeroless system, however. For example, converting the base-10 number 10 to the base-3 zeroless system: Though 9 (the third digit slot: 3^2) would fit into 10, this would leave no possible configuration of the final two digits since their minimum values are 1*3^1 = 3 and 1*3^0 = 1 respectively.
Realistically, my digit alphabet will contain A-Z, so I'm looking for a quick, generalized conversion method that can do this without trial and error or counting up from zero.
Edit
The accepted answer by n.m. is primarily a string-manipulation-based solution.
For a purely mathematical solution see kennytm's links:
What is the algorithm to convert an Excel Column Letter into its Number?
How to convert a column number (eg. 127) into an excel column (eg. AA)
Convert to base-3-with-zeroes first (digits 0AB), and from there, convert to base-3-without-zeroes (ABC), using these string substitutions:
A0 => 0C
B0 => AC
C0 => BC
Each substitution either removes a zero, or pushes one to the left. In the end, discard leading zeroes.
It is also possible, as an optimisation, to process longer strings of zeros at once:
A000...000 = 0BBB...BBC
B000...000 = ABBB...BBC
C000...000 = BBBB...BBC
Generalizable to any base.

Find the nearest nice number

Given a base currency of GBP £, and a table of other currencies accepted in a shop:
Currency Symbol Subunits LastToGBPRate
------------------------------------------------------
US Dollars $ 100 0.592662000
Euros € 100 0.810237000
Japanese Yen ¥ 1 0.005834610
Bitcoin ฿ 100000000 301.200000000
We have a working method that converts a given amount in GBP Pence (AKA cents) into Currency X cents. Given a price of 999 (£9.99), for the above currencies it would return:
Currency Symbol
---------------------
US Dollars 1686
Euros 1233
Japanese Yen 1755
Bitcoin 3482570
This is all working absolutely fine. We then have a Format Currency method which converts them all into nice looking numbers:
Currency Formatted
---------------------
US Dollars $16.86
Euros €12.33
Japanese Yen ¥1755
Bitcoin ฿0.03482570
Now the problem we want to solve, is to round these amounts to the nearest meaningful pretty number in a general purpose algorithm given the information above.
This serves two important benefits:
Prices for most currencies should appear static for visitors over short-medium term time frames
Presents the visitor with a culturally meaningul price point which encourages sales
A meaningful number is one where the smallest unit displayed isn't smaller than the value of say £0.10, and a pretty number is one which ends in 49 or 99. Example outputs:
Currency Formatted Meaninful and Pretty
-----------------------------------------------------
US Dollars $16.86 $16.99
Euros €12.33 €12.49
Japanese Yen ¥1755 ¥1749
Bitcoin ฿0.03482570 ฿0.0349
I know it is possible to do this with a single algorithm with all the information given, but I'm struggling to work out even where to start. Can anyone show me how to achieve this, or give pointers?
Please note, storing a general formatting rule for each currency is not adequate because assume for example the price of Bitcoin 10x's, the formatting rule will need updating. I'm looking for a solution that doesn't need any manual maintainance/checking.
For a given decimal value X, you want to find the smallest integer Y such that YA + B as close as possible to X, for some given A and B. E.g. in the case of dollar, you have A = .5 and B = .49.
In general, for your problem, A and B can be computed via the formula:
V = value of £0.10 in target currency
K = smallest power of ten (10^k) such that 9*10^k >= V
and k <= -2 (this condition I added based on your examples, but contrary
to your definition)
= 10^min(-2, ceil(log10(V / 9)))
A = 50 * K
B = 49 * K
Note that without the extra condition, since 0.09 dollars is less than 0.10 pounds, we would get 14.9 as the result for 16.86 dollars.
With some transformation we get
Y ~ (X - B) / A
And since Y is integer, we have
Y = round((X - B) / A)
The result is then YA + B.
Convert £0.10 to the current currency to determine the smallest displayable digit (SDD)
(bounded by the number of available digits in that currency).
Now we basically have 3 choices of numbers:
... (3rdSDD-1) 9 9 (if 3rdSDD is 0, it will obviously carry from 4thSDD and so on, as subtraction normally works)
We'll pick this when 10*2ndSDD + 1stSDD < 24
... 3rdSDD 4 9
We'll pick this when 24 <= 10*2ndSDD + 1stSDD < 74
... 3rdSDD 9 9
We'll pick this when 74 < 10*2ndSDD + 1stSDD
It should be trivial to figure it out from here.
Some multiplication and modulus to get you 2ndSDD and 1stSDD.
Basic subtraction to get you ... (3rdSDD-1).
A few if-statements to pick one of the above cases.
Example:
For $16.86, our 3 choices are $15.99, $16.49 and $16.99.
We pick $16.99 since 74 < 86.
For €12.33, our 3 choices are €11.99, €12.49 and €12.99.
We pick €12.49 since 24 <= 33 < 74.
For ¥1755, our 3 choices are ¥1699, ¥1749 and ¥1799.
We pick ¥1749 since 24 <= 55 < 74.
For ฿0.03482570, our 3 choices are ฿0.0299, ฿0.0349 and ฿0.0399.
We pick ฿0.0349 since 24 <= 48 < 74.
And, just to show the carry:
For $100000.23, our 3 choices are $99999.99, $100000.49 and $100000.99.
We pick $99999.99 since 23 < 24.
Here's an ugly answer:
def retail_round(number):
"""takes a decimal.Decimal and retail rounds it"""
ending_digits = str(number)[-2:]
if not ending_digits in ("49","99"):
rounding_adjust = (99 - int(ending_digits)) % 50
if rounding_adjust <= 25:
number = str(number)[:-2]+str(int(ending_digits)+int(rounding_adjust))
else:
if str(number)[-3] == '.':
number = str(int(number) - .01)
else:
number = str(int(str(number)[:-2]+"00")-1)
return decimal.Decimal(number)
>>> import decimal
>>> retail_round(decimal.Decimal("15.50"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.51"))
Decimal('14.99')
>>> retail_round(decimal.Decimal("15.75"))
Decimal('15.99')
>>> retail_round(decimal.Decimal("1575"))
Decimal('1599')
>>> retail_round(decimal.Decimal("1550"))
Decimal('1499')
EDIT: this is a bit better solution, using decimal.Decimal
Currency = collections.namedtuple("Currency",["name","symbol",
"subunits"])
def retail_round(currency, amount):
"""returns a decimal.Decimal amount of the currency, rounded to
49 or 99."""
adjusted = ( amount / currency.subunits ) % 100 # last two digits
print(adjusted)
if adjusted < 24:
amount -= (adjusted + 1) * currency.subunits # down to 99
elif 24 <= adjusted < 74:
amount -= (adjusted - 49) * currency.subunits # to 49
else:
amount -= (adjusted - 99) * currency.subunits # up to 99
return amount
Calculate the maximum length of the price, assume its something like 0.00001. (You can do that by changing £0.10 to the currency, then taking the 10 base log of it, getting its ceil and that power of 10).
Eg: £0.10 = 17.1421309¥
log(17.1421309) = 1.234
ceil(1.234) = 2
10^2 = 100
so
¥174055 will be ¥174900
Adjust the number for the digit, add 1, round to 50, subtract 1:
174055 -> (round((174055/100+1)/50)*50-1)*100 = 174900
Plain and simple.

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

Resources