Algorithm for compacting hexadecimal GUID in URL-safe way? - algorithm

I've got a database with rows that are identified by 32-character hexadecimal GUIDs (stored as binary). I'm wondering how to compress these strings dynamically into a shorter, but still user-friendly representation... ideally for use in shared URLs. Since they're 32 characters hexadecimally (and case-insensitive currently) ... I tried hitting the binary representation with base64 encoding. That got them from 32 characters to 22 characters, but I wasn't sure if there was anything better that was common yet straightforward.
I'm also thinking about getting creative, given that even emoji now is technically URL-safe. Not sure if that's a good idea, though.
Has anyone considered cross-platform solutions for this problem before? Is it better just to generate new IDs entirely with a smaller subset?

You are allowed to use 0-9, a-z, A-Z, and !$'()*+,-._~ in a URI (which does not include the characters with special syntax interpretations). That's 74 characters. That is a little better than 64. You can use a simple scheme to pull 6 or 7 bits from your stream of bits and use that to select one of the allowed URI characters.
To encode, pull six bits from your stream. If it is less than 54, then emit the corresponding character in the set of 74. If it is 54 or more, pull one more bit on the bottom of that. You now have a seven bit number in the range of 108..127. Subtract 108 and add 54 to get the range 54..73. Emit that character from the set.
You now have an average number of bits per character of 6*54/74 + 7*20/74 = 6.27. Or 1.276 characters per byte. Your 16-byte ID would then be encoded, on average, in 20.4 characters. Actually a bit more since you will have to stuff a few zero bits at the end to get the last character out. The real-world average is 21.1303, with a minimum of 19 and a maximum of 22.
This is faster and simpler than trying to do a base conversion with large integers, and gives essentially the same performance, 21 characters.
Do your 16-byte IDs tend to have leading or trailing zeros, or other patterns amendable to compression? If so, then you can arrange the encoding scheme to use fewer characters for those cases.

See this Javascript implementation:
function toDigits(n, b){
var digits = []
while(n.isPositive()){
digits.push(n.remainder(b).valueOf())
n = n.quotient(b);
}
return digits
}
function fromDigits(digits, b){
n = BigInteger(0);
for(var i=0;i<digits.length;i++){
var d=parseInt(digits[i],b);
n = n.multiply(b).add(d);
}
return n;
}
function changebase(n,from_base,to_base){
var temp=fromDigits(n,from_base);
return toDigits(temp,to_base);
}
var unreserved_characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~";
var number_of_unreserved_characters=unreserved_characters.length;
var guid="9ec54806c242982ca059661b6db74ab9";
var newbase=changebase(guid,16,number_of_unreserved_characters);
var newurl="";
for(var i=0;i<newbase.length;i++){
newurl+=unreserved_characters[newbase[i]];
}
I used a BigInteger library http://silentmatt.com/biginteger/.
This implementation converts the hex into a new base that is the number of unreserved characters allowed in URI. This might be a little better than base64 since it hase 2 additional characters for a total of 66 characters compared to 64 in base64. That might not make much difference though. So depending on whether you dont mind browser compatibility, you can add other ascii characters to the list.
for instance using:
var unreserved_characters="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_.~ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø׃áíóúñѪº¿®¬½¼¡«»░▒▓│┤ÁÂÀ©╣║╗╝¢¥┐└┴┬├─┼ãÃ╚╔╩╦╠═╬¤ðÐÊËÈıÍÎÏ┘┌█▄¦Ì▀ÓßÔÒõÕµþÞÚÛÙýݯ´≡±‗¾¶§÷¸°¨·¹³²■";
has much more characters and reduces the size even more, and might work on your target browsers.

Related

Tiny URL system design

I have read and watched many youtube videos & links which all provide same solution which is:
Use a distributed counter like zookeeper
Counter max limit can be 3.5 trillion
Convert the Counter value to Base62
which is all fine when the counter value is small.
e.g.
generated counter value: 120001 => base62 value FMJQmhBR
but when the counter provides large counter value like below the base62 value length also increases.
generated counter value: 120003658=> base62 value HRGZF8RiHC6y
So how can this be a solution for exact tiny url with exact 8 length.
https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html
https://www.youtube.com/watch?v=eCLqmPBIEYs
https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s
First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.
Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.
Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.
--
I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.
That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:
When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)
Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits
However, there are many solutions for what to do for collisions
ignore them - they will be rare (ideally)
choose the next value
hash the result again (with your input)
increment the size of the returned string
...
Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ
Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)
This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string
This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements
import hashlib
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest = m.digest() # hash as bytes b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex() # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039
for hashlen in range(100, 1, -1):
number = int(hex_str[:hashlen], 16) # first_n_chars(str(hex)) -> decimal
val = ""
while number != 0:
val = "{}{}".format(BASE62[number % 62], val) # append new chars to front
number = number // 62 # integer division
if len(val) <= 8:
break
print(val) # E0IxW0zn
base62 logic from How to fix the code for base62 encoding with Python3?

hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function
Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

encoding most efficient way 64 character sequence for lesser writing time to memory

The problem is as follows: Given a 64 charater sequences which is built from the english alphabet having 26 charcaters therefore just case characters, the occurrence distribution is such that any character has an equal chance of occurring at a given time.
Due to the fact that I have some computation which needs to be done with regards to the sequences, which requires writing to a text files, since the amount of sequences goes beyond a given ram. I thought of encoding a sequence such that I would be able to have lesser amount of bytes to write to a text file per given sequence.
With such reasoning I thought of the L-Z which would allow me to go down to 40 bytes. Is there any way by which i can go lower to encode a 64 character sequence?
With a large(-ish) lookup table you could encode each of the possible 26^64 character sequences in 301 (actually 300.8281==log2(26^64)) bits. This is slightly less than the 320 bits your straightforward compression would use. It is also the theoretical minimum given that any of the 26 characters occurs with equal probability.
Since you could derive the lookup table at any time you don't even need to store it. I suppose the bits used to represent the functions to encode a character string into a 301-bit integer and vice-versa ought to be counted into your compression ratio.
This is, of course, a long-winded restatement of #lhf's comment.

How to encode a number as a string such that the lexicographic order of the generated string is in the same order as the numeric order

For eg. if we have two strings 2 and 10, 10 will come first if we order lexicographically.
The very trivial sol will be to repeat a character n number of time.
eg. 2 can be encoded as aa
10 as aaaaaaaaaa
This way the lex order is same as the numeric one.
But, is there a more elegant way to do this?
When converting the numbers to strings make sure that all the strings have the same length, by appending 0s in the front if necessary. So 2 and 10 would be encoded as "02" and "10".
While kjampani's solution is probably the best and easiest in normal applications, another way which is more space-efficient is to prepend every string with its own length. Of course, you need to encode the length in a way which is also consistently sorted.
If you know all the strings are fairly short, you can just encode their length as a fixed-length base-X sequence, where X is the number of character codes you're willing to use (popular values are 64, 96, 255 and 256.) Note that you have to use the character codes in lexicographical order, so normal base64 won't work.
One variable-length order-preserving encoding is the one used by UTF-8. (Not UTF-8 directly, which has a couple of corner cases which will get in the way, but the same encoding technique. The order-preserving property of UTF-8 is occasionally really useful.) The full range of such compressed codes can encode values up to 42 bits long, with an average of five payload bits per byte. That's sufficient for pretty long strings; four terabyte long strings are pretty rare in the wild; but if you need longer, it's possible, too, by extending the size prefix over more than one byte.
Break the string into successive sub strings of letters and numbers and then sort by comparing each substring as an integer if it's an numeric string
"aaa2" ---> aaa + 2
"aaa1000" ---> aaa + 1000
aaa == aaa
Since they're equal, we continue:
1000 > 2
Hence, aaa1000 > aaa2.

Resources