Tiny URL system design - algorithm

I have read and watched many youtube videos & links which all provide same solution which is:
Use a distributed counter like zookeeper
Counter max limit can be 3.5 trillion
Convert the Counter value to Base62
which is all fine when the counter value is small.
e.g.
generated counter value: 120001 => base62 value FMJQmhBR
but when the counter provides large counter value like below the base62 value length also increases.
generated counter value: 120003658=> base62 value HRGZF8RiHC6y
So how can this be a solution for exact tiny url with exact 8 length.
https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html
https://www.youtube.com/watch?v=eCLqmPBIEYs
https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s

First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.
Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.
Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.
--
I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.
That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:
When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)

Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits
However, there are many solutions for what to do for collisions
ignore them - they will be rare (ideally)
choose the next value
hash the result again (with your input)
increment the size of the returned string
...
Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ
Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)
This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string
This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements
import hashlib
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest = m.digest() # hash as bytes b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex() # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039
for hashlen in range(100, 1, -1):
number = int(hex_str[:hashlen], 16) # first_n_chars(str(hex)) -> decimal
val = ""
while number != 0:
val = "{}{}".format(BASE62[number % 62], val) # append new chars to front
number = number // 62 # integer division
if len(val) <= 8:
break
print(val) # E0IxW0zn
base62 logic from How to fix the code for base62 encoding with Python3?

Related

Need a function to create a smaller hash from a larger hash

Basically I have 24 digit hexadecimal id values. I need a function that can take one of these ids and turn it into a 10 digit decimal value. It does not need to be cryptographically secure. I just need to ensure that the same input will always get the same output, and that the resulting value has as low of a chance as possible to be the same with different inputs.
Something like this should work as you are asking -
shorterHash := MD5(hexId) # 16 byte
return shortedHash.substring(0, 10) # this is still non-colliding enough
I hope there are many standard implementations of MD5 on your language on internet.

hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function
Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

Creating a large number of random serial keys

I am developing a application where user is only authenticated if he enters a serial key correctly. This serial key is matched with the serial key already present in database. This serial key should consists of 6 digit and 3 alphabets. Users get these key in written form from me.
How do I generate a large number (100,000,000 approx) of serial keys?
Simplest way in my opinion is:
find a desired number in range [1,10000000), let it be x and it
will represent your 6 digits.
Find 3 characters (let's assume in range a-z). Let them be a,b,c.
Your number is (a-'a' + b-'a'*26 + c-'a'*26*26)*10000000 + x
(Where 'a' is the ascii representation of the character 'a', and a-'a' means the numeric ascii subtraction)
Note that the generated number can even fit in a long - assuming it is 64 bits.
Generation of several numbers can be simply done with iteration, start with a=b=c='a', and x=0, and start advancing (increase by 1). Once you reached x=10000000, increase a, if a passed z - increase b, .... (similar to what you would do in normal integer addition arithmetic).
You can use a secure random number generator like SecureRandom to generate the keys.
But as already mentioned in a comment, if you generate that many keys, the space of all possible keys is not much larger. So it will be easy for an attacker to try out keys until one is found that works.

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Resources