Need a function to create a smaller hash from a larger hash

Need a function to create a smaller hash from a larger hash - algorithm

Basically I have 24 digit hexadecimal id values. I need a function that can take one of these ids and turn it into a 10 digit decimal value. It does not need to be cryptographically secure. I just need to ensure that the same input will always get the same output, and that the resulting value has as low of a chance as possible to be the same with different inputs.

Something like this should work as you are asking -
shorterHash := MD5(hexId) # 16 byte
return shortedHash.substring(0, 10) # this is still non-colliding enough
I hope there are many standard implementations of MD5 on your language on internet.

Related

Tiny URL system design

I have read and watched many youtube videos & links which all provide same solution which is:
Use a distributed counter like zookeeper
Counter max limit can be 3.5 trillion
Convert the Counter value to Base62
which is all fine when the counter value is small.
e.g.
generated counter value: 120001 => base62 value FMJQmhBR
but when the counter provides large counter value like below the base62 value length also increases.
generated counter value: 120003658=> base62 value HRGZF8RiHC6y
So how can this be a solution for exact tiny url with exact 8 length.
https://www.linqz.io/2018/10/how-to-build-a-tiny-url-service-that-scales-to-billions.html
https://www.youtube.com/watch?v=eCLqmPBIEYs
https://www.youtube.com/watch?v=JQDHz72OA3c&t=1862s

First: there absolutely is a compression limit. If your chosen representation has a maximum length, that imposes a hard limit on your key space.
Let's unpack that a little. Let's say you've got 80 guests at a party, and you want to give each guest a unique label (for their drink cup or something). If you've decided that each label will be a single letter from the English alphabet, then you only have enough unique labels for 26 guests.
Second: FMJQmhBR is not the most efficient way to represent the number 120001. It takes 17 bits in binary: 11101010011000001 (not sure which endianness that is). 16 bits is just two ASCII characters, and three ASCII characters can accommodate nearly 17 million unique values. And that's without any kind of special, ZIP-like compression.
--
I think most URL shorteners work essentially by assigning a counting number to each URL that someone shortens. So, the very first URL that gets submitted will be given ID=1: they save the whole URL in the database and associate it with that number. The second URL gets ID=2, etc.
That's pretty crude, though. For a variety of reasons, they don't want to hand those IDs out in order. But if they know how long they want the identifiers to be, it's not hard hand those IDs out in random order:
When someone submits a URL, the system picks a random number between 0 and the highest-possible ID. If the URL identifiers are all supposed to be 8 ASCII characters, that means they pick a random number between 0 and 2^(8*8) = 1.844674407e19.
Then they check their DB to see if they've handed out that ID. If they have, they pick a different random number. They repeat this until they pick an ID that hasn't been handed out. (I think there are more efficient algorithms for this, but the effect is the same and this is easiest to understand.)

Given that you are not hashing every url, but a vaguely-predictable number, you could hash the result and take the first N bits
However, there are many solutions for what to do for collisions
ignore them - they will be rare (ideally)
choose the next value
hash the result again (with your input)
increment the size of the returned string
...
Here's a great video on cuckoo hashing (which is a structure of hashes relevant here):
https://www.youtube.com/watch?v=HRzg0SzFLQQ
Here's an example in Python which finds an 8-character string from the hash which should be fairly unique (this could then be collected into a sorted data structure mapping it into a URL)
This works by first hashing the value with an avalanching hash (SHA-265) and then loops to find the minimum amount of it (slice from the front of the hex string) to form an 8-char base62 string
This could be made much more efficient (even, for example by bisecting), but may be clearer as-is and depends hugely on unspecified algorithm requirements
import hashlib
BASE62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
m = hashlib.sha256()
m.update("https://stackoverflow.com/questions/65714033/tiny-url-system-design".encode())
digest = m.digest() # hash as bytes b',\xdb3\x8c\x98g\xd6\x8b\x99\xb6\x98#.\\\xd1\x07\xa0\x8f\x1e\xb4\xab\x1eg\xdd\xda\xd6\xa3\x1d\xb0\xb2`9'
hex_str = digest.hex() # string of hex chars 2cdb338c9867d68b99b698232e5cd107a08f1eb4ab1e67dddad6a31db0b26039
for hashlen in range(100, 1, -1):
number = int(hex_str[:hashlen], 16) # first_n_chars(str(hex)) -> decimal
val = ""
while number != 0:
val = "{}{}".format(BASE62[number % 62], val) # append new chars to front
number = number // 62 # integer division
if len(val) <= 8:
break
print(val) # E0IxW0zn
base62 logic from How to fix the code for base62 encoding with Python3?

I need to generate a 32 digit random number using lua, please suggest some way

I want to generate a 32 digit key containing alphanumeric characters and they should always be UNIQUE. Please suggest a way of doing that.
I've used math.random function but I get the same random number over and over again, i want it to be unique.

If you want a unique number you have to have a list of used numbers. Otherwise you will always have the chance of getting a used number again. Although it is quite unlikely with 32 digits.
You obviously did not read the documentation on math.random:
http://www.lua.org/manual/5.3/manual.html#pdf-math.random
Otherweise you would know that math.random will always give you the same pseudo random numbers unless you change the seed value using math.randomseed...
Please make sure to read the documentation on functions befor you use them.

The trivial program below generates 1000 random 32-bit keys without repetion with no wasted effort:
M=2^32-1
R={}
N=1000
n=0
for i=1,N do
local x=math.random(M)
if R[x]==nil then n=n+1 R[x]=n print(n,i,n==i,x) end
end
The point is that M is so large that repetitions are very unlikely, even if we increase N to one million.

hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function

Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

Chicken/Egg problem: Hash of file (including hash) inside file! Possible?

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...

It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)

Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.

I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.

No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.

the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible

It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

How to generate a verification code/number?

I'm working on an application where users have to make a call and type a verification number with the keypad of their phone.
I would like to be able to detect if the number they type is correct or not. The phone system does not have access to a list of valid numbers, but instead, it will validate the number against an algorithm (like a credit card number).
Here are some of the requirements :
It must be difficult to type a valid random code
It must be difficult to have a valid code if I make a typo (transposition of digits, wrong digit)
I must have a reasonable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Given these requirements, how would you generate such a number?
EDIT :
#Haaked: The code has to be numerical because the user types it with its phone.
#matt b: On the first step, the code is displayed on a Web page, the second step is to call and type in the code. I don't know the user's phone number.
Followup : I've found several algorithms to check the validity of numbers (See this interesting Google Code project : checkDigits).

After some research, I think I'll go with the ISO 7064 Mod 97,10 formula. It seems pretty solid as it is used to validate IBAN (International Bank Account Number).
The formula is very simple:
Take a number : 123456
Apply the following formula to obtain the 2 digits checksum : mod(98 - mod(number * 100, 97), 97) => 76
Concat number and checksum to obtain the code => 12345676
To validate a code, verify that mod(code, 97) == 1
Test :
mod(12345676, 97) = 1 => GOOD
mod(21345676, 97) = 50 => BAD !
mod(12345678, 97) = 10 => BAD !
Apparently, this algorithm catches most of the errors.
Another interesting option was the Verhoeff algorithm. It has only one verification digit and is more difficult to implement (compared to the simple formula above).

For 1M combinations you'll need 6 digits. To make sure that there aren't any accidentally valid codes, I suggest 9 digits with a 1/1000 chance that a random code works. I'd also suggest using another digit (10 total) to perform an integrity check. As far as distribution patterns, random will suffice and the check digit will ensure that a single error will not result in a correct code.
Edit: Apparently I didn't fully read your request. Using a credit card number, you could perform a hash on it (MD5 or SHA1 or something similar). You then truncate at an appropriate spot (for example 9 characters) and convert to base 10. Then you add the check digit(s) and this should more or less work for your purposes.

You want to segment your code. Part of it should be a 16-bit CRC of the rest of the code.
If all you want is a verification number then just use a sequence number (assuming you have a single point of generation). That way you know you are not getting duplicates.
Then you prefix the sequence with a CRC-16 of that sequence number AND some private key. You can use anything for the private key, as long as you keep it private. Make it something big, at least a GUID, but it could be the text to War and Peace from project Gutenberg. Just needs to be secret and constant. Having a private key prevents people from being able to forge a key, but using a 16 bit CR makes it easier to break.
To validate you just split the number into its two parts, and then take a CRC-16 of the sequence number and the private key.
If you want to obscure the sequential portion more, then split the CRC in two parts. Put 3 digits at the front and 2 at the back of the sequence (zero pad so the length of the CRC is consistent).
This method allows you to start with smaller keys too. The first 10 keys will be 6 digits.

Does it have to be only numbers? You could create a random number between 1 and 1M (I'd suggest even higher though) and then Base32 encode it. The next thing you need to do is Hash that value (using a secret salt value) and base32 encode the hash. Then append the two strings together, perhaps separated by the dash.
That way, you can verify the incoming code algorithmically. You just take the left side of the code, hash it using your secret salt, and compare that value to the right side of the code.

I must have a reasonnable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Well, if you want it to have at least one million combinations, then you need at least six digits. Is that short enough?

When you are creating the verification code, do you have access to the caller's phone number?
If so I would use the caller's phone number and run it through some sort of hashing function so that you can guarantee that the verification code you gave to the caller in step 1 is the same one that they are entering in step 2 (to make sure they aren't using a friend's validation code or they simply made a very lucky guess).
About the hashing, I'm not sure if it's possible to take a 10 digit number and come out with a hash result that would be < 10 digits (I guess you'd have to live with a certain amount of collision) but I think this would help ensure the user is who they say they are.
Of course this won't work if the phone number used in step 1 is different than the one they are calling from in step 2.

Assuming you already know how to detect which key the user hit, this should be doable reasonably easily. In the security world, there is the notion of a "one time" password. This is sometimes referred to as a "disposable password." Normally these are restricted to the (easily typable) ASCII values. So, [a-zA-z0-9] and a bunch of easily typable symbols. like comma, period, semi colon, and parenthesis. In your case, though, you'd probably want to limit the range to [0-9] and possibly include * and #.
I am unable to explain all the technical details of how these one-time codes are generated (or work) adequately. There is some intermediate math behind it, which I'd butcher without first reviewing it myself. Suffice it to say that you use an algorithm to generate a stream of one time passwords. No matter how mnay previous codes you know, the subsequent one should be impossibel to guess! In your case, you'll simply use each password on the list as the user's random code.
Rather than fail at explaining the details of the implementation myself, I'll direct you to a 9 page article where you can read up on it youself: https://www.grc.com/ppp.htm

It sounds like you have the unspoken requirement that it must be quickly determined, via algorithm, that the code is valid. This would rule out you simply handing out a list of one time pad numbers.
There are several ways people have done this in the past.
Make a public key and private key. Encode the numbers 0-999,999 using the private key, and hand out the results. You'll need to throw in some random numbers to make the result come out to the longer version, and you'll have to convert the result from base 64 to base 10. When you get a number entered, convert it back to base64, apply the private key, and see if the intereting numbers are under 1,000,000 (discard the random numbers).
Use a reversible hash function
Use the first million numbers from a PRN seeded at a specific value. The "checking" function can get the seed, and know that the next million values are good. It can either generate them each time and check one by one when a code is received, or on program startup store them all in a table, sorted, and then use binary search (maximum of compares) since one million integers is not a whole lot of space.
There are a bunch of other options, but these are common and easy to implement.
-Adam

You linked to the check digits project, and using the "encode" function seems like a good solution. It says:
encode may throw an exception if 'bad' data (e.g. non-numeric) is passed to it, while verify only returns true or false. The idea here is that encode normally gets it's data from 'trusted' internal sources (a database key for instance), so it should be pretty usual, in fact, exceptional that bad data is being passed in.
So it sounds like you could pass the encode function a database key (5 digits, for instance) and you could get a number out that would meet your requirements.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio