hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash - algorithm

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function

Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

Related

How to build unique file names based on its content?

I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.
The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?
I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.
I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?
Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1
For 500K files, that's about 1/2N-37
Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.
If that won't do, then 24 chars gives 1/283.
If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.
In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).
You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).
To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).
I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.

Encode an array of integers to a short string

Problem:
I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers.
For example, here is an array of 360 elements:
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,
0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0,
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.
Goal:
The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y
What I've tried:
I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one".
To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5".
That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...)
Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect.
You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours.
Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy.
The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse".
Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best.
Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have:
14 1s (3.89% of data)
4 2s (1.11%)
1 3s, 4s and 5s (0.28%)
339 0s (94.17%)
Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.

Need an maths algorithm to encode number in big integer to integer

I want to convert a number value of 100 digits into lessthan 10 digits and vice versa.
So I pass that encoded number to mobile user and on getting back can make 100 digits number again.
I want to use it in PHP, .NET or JS.
But before that I need an algorithm for that.
I have some idea to use simple divide-subtract and add-multiply options in my mind to implement. But need some more secure than that.
What you're asking for is impossible. You are trying to pigeonhole 10^100 items into 10^10 boxes. Some box will get more than one item and so it's impossible to invert back to "the" original item.
You could encode the 100-digit base-10 numbers as a 56-digit base-62 number (use uppercase and lowercase Roman alphabet and digits 0-9). The math here is 100 * log(10) / log(62).
To encode using less than ten characters from some alphabet, you need an alphabet with ~2^34 symbols. The math here is 100 * log(10) / log(number of symbols). Good luck with that.
If you have more than 10 000 000 000 different possible values in the 100 digit number you can not possibly map that to a 10 digit number and reliably map back to the original number.
A 100 digit number, I assume this is a base ten number, When talking about numbers on computers talk of 'digits' is almost meaningless.
If you actually mean a 100bit integer, then this wont easily fit into a single 64bit integer ( range +/- 9,223,372,036,854,775,808 ) then you have not phrased your question all that well. And no amount of compression or encoding will let you represent 100bits using no more than 10bits.
If you mean 100 figures in base ten, then you are dealing with bignums so should probably just treat them as bytes and use a bignum library.
100 base ten figures is still less than 512 bits.
Assuming that the 100-digit number is base 10, then if my math is not wrong you'll need 10 base 100 digits to represent the same number. So instead of using just characters from 0-9, you'll need to expand the characters to include other glyphs, including upper-case and lower-case letters, etc., to complete a 100 character alphabet. OK, my math is wrong, so disregard this, but consider the next paragraph.
Another thought is to use a hashing algorithm to derive a 10-byte hash from your 100-digit number and use that as key in a server-side database (hash-table). No encoding/decoding, just send the key to the mobile client, the mobile client uses the key to fetch the 100-digit number from the server.

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Are fragments of hashes collision-resistent?

If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?
Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.
It is 256, not 255.
Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.
Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).
Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.
MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.
Assume we have a pre-determined message1. hash1 = md5(message1)
Now choose a message2 randomly, and set hash2 = md5(message2).
In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.
It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.
MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.
If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.
If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:
fdadda221fd71619e6c0139730b012577dd4de90
fdadda221fd71619e6c/0139730b012577dd4de90
fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
Depends on the purpose of the hash.
Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

Resources