How to avoid crc16 collisions? - algorithm

I am experiencing a pretty annoying issue, using a crc16 hash to manage some of my informations.
In my application, I pass some information into an url parameter, a huge encoded context. That context allow the users to recover their old searches. In that context, i have some elements I hash to be sure it won't take too much characters.
It seems that some elements return the same hash (crc16 algorithm).
I take the has and transform it to a string : crc.ToString("X4");
For example, two different elements gives me : 5A8E.
I tried to use a crc32, but if I do that, the old context won't be recognize.
Do you have any idea how i can find a solution to that ? Thanks a lot

Even if CRC16 was an ideal hash function (which it's not), with just 16 bits, the Birthday Paradox means that there's around a 50% chance of a hash collision in a set of just 2^8 = 256 items. You almost certainly need more bits.
You can't keep the old hashes working and make them distinguish existing collisions -- that's a contradiction. But you can implement a new, better hashing scheme, add a flag to the URL parameters to indicate that you're using this new scheme, make sure that all your pages generate only these new-style URLs, and "grandfather in" the old-style URLs (which will continue to produce the same collisions as before). I'd suggest giving users a big, bright message to update their bookmarks, and auto-redirecting the page, whenever you get an old-style URL.

To explain the other solution i thought about and I am implementing.
I just prepare a crc32 and a crc16 hash for my elements. I use the 32 for the new urls i now build, but use the crc16 hash as a fallback for old urls.
So, when I try to compare the Hash, i start with the new Hash, and, if i can't find any element, i go to my fallback and compare it with the crc16 hash.
This allow me to get any case.

Related

What's the most efficient way of combining switch/if statements

This question doesn't address any programming language in particular but of course I'm happy to hear some examples.
Imagine a big number of files, let's say 5000, that have all kinds of letters and numbers in it. Then, there is a method that receives a user input that acts as an alias in order to display that file. Without having the files sorted in a folder, the method(s) need to return the file name that is associated to the alias the user provided.
So let's say user input "gd322" stands for the file named "k4e23", the method would look like
if(input.equals("gd322")){
return "k4e23";
}
Now, imagine having 4 values in that method:
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
case s3red: return 536fg;
case h563d: return h425d;
} //switch on string, no break, no string indicators, ..., pls ignore the syntax, it's just pseudo
Keeping in mind we have 5000 entries, there are probably more than just 2 entries starting with g. Now, if the user input starts with 's', instead of wasting CPU cycles checking all the a's, b's, c's, ..., we could also make another switch for this, which then directs to the 'next' methods like this:
switch(input[0]){ //implying we could access strings like that
case a: switchA(input);
case b: switchB(input);
// [...]
case g: switchG(input);
case s: switchS(input);
}
So the CPU doesn't have to check on all of them, but rather calls a method like this:
switchG(String input){
switch(input){
case gd322: return fw332;
case g344d: return 5g4gh;
// [...]
}
Is there any field of computer science dealing with this? I don't know how to call it and therefore don't know how to search for it but I think my thoughts make sense on a large scale. Pls move the thread if it doesn't belong here but I really wanna see your thoughts on this.
EDIT: don't quote me on that "5000", I am not in the situation described above and I wanted to talk about this completely theoretical, it could also be 3 entries or 300'000, maybe even less or more
If you have 5000 options, you're probably better off hashing them than having hard-coded if / switch statements. In c++ you could also use std::map to pair a function pointer or other option handling information with each possible option.
Interesting, but I don't think you can give a generic answer. It all depends on how the code is executed. Many compilers will have all kinds of optimizations, in the if and switch, but also in the way strings are compared.
That said, if you have actual (disk) files with those lists, then reading the file will probably take much longer than processing it, since disk I/O is very slow compared to memory access and CPU processing.
And if you have a list like that, you may want to build a hash table, or simply a sorted list/array in which you can perform a binary search. Sorting it also takes time, but if you have to do many lookups in the same list, it may be well worth the time.
Is there any field of computer science dealing with this?
Yes, the science of efficient data structures. Well, isn't that what CS is all about? :-)
The algorithm you described resembles a trie. It wouldn't be statically encoded in the source code with switch statements, but would use dynamic lookups in a structure loaded from somewhere and stuff, but the idea is the same.
Yes the problem is known and solved since decades. Hash functions.
Basically you have a set of values (here strings like "gd322", "g344d") and you want to know if some other value v is among them.
The idea is to put the strings in a big array, at an index calculated from their values by some function. Given a value v, you'll compute an index the same way, and check whether the value v is here or not. Much faster than checking the whole array.
Of course there is a problem with different values falling at the same place : collisions. Some magic is needed then : perfect hash functions whose coefficients are tweaked so values from the initial set don't cause any collisions.

Chicken/Egg problem: Hash of file (including hash) inside file! Possible?

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...
It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)
Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.
I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.
No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.
the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible
It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

Is it possible to find out which hash algorithm was used in these strings?

I don't want to reverse it. I just want to be sure what hash algorithm was used on these strings (I'm not sure if it's md5):
d27918bcc2a8562dc4549c2c00111e66
889f071e04755db26579a19f4303654e
47a21a13ee822c1450155bd0033b0f1d
Is there a way to do it?
One of the source for the strings above is certainly: '9915757678'
They're each 32 characters, so 128 bits. So it could be MD5.
However, there is no way to tell. Any hash function worth its salt will spread the hash values evenly throughout the entire output space, so if you have just a bunch of outputs, there's no way to tell hash functions apart.
Unless you can make some reasonable guesses about the input, and do some brute-forcing, of course.
It fits MD5() hash form (length-wise) but it could be just as well SHA1 hash stored in CHAR(32) field. As others have said - unless you have an example of input value. Then you could use a tool like this:
http://www.insidepro.com/hashes.php
to generate hashes using several diffrent algorithms and try to find if any one fits.
You're even more out of luck, if there was salt added before hashing.
No certain way, but this looks like MD5.
Based on size, these could be one of ntlm or md4 or md5.
I know I'm too late here!, but posting this as I didn't see this possible answer.

Are fragments of hashes collision-resistent?

If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?
Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.
It is 256, not 255.
Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.
Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).
Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.
MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.
Assume we have a pre-determined message1. hash1 = md5(message1)
Now choose a message2 randomly, and set hash2 = md5(message2).
In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.
It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.
MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.
If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.
If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:
fdadda221fd71619e6c0139730b012577dd4de90
fdadda221fd71619e6c/0139730b012577dd4de90
fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
Depends on the purpose of the hash.
Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

Generating unique N-valued key

I want to generate unique random, N-valued key.
This key can contain numbers and latin characters, i.e. A-Za-z0-9.
The only solution I am thinking about is something like this (pseudocode):
key = "";
smb = "ABC…abc…0123456789"; // allowed symbols
for (i = 0; i < N; i++) {
key += smb[rnd(0, smb.length() - 1)]; // select symbol at random position
}
Is there any better solution? What can you suggest?
I would look into GUIDs. From the Wikipedia entry, "the primary purpose of the GUID is to have a totally unique number," which sounds exactly like what you are looking for. There are several implementations out there that generate GUIDs, so it's likely you will not have to reinvent the wheel.
Keeping in mind that the whole field of cryptography relies on, amongst other things, making random numbers. Therefore the NSA, the CIA, and some of the best mathematicians in the world are working on this so I guarantee you that there are better ideas.
Me? I'd just do what fbrereto suggests, and just get a guid. Or look into cryptographic key generators, or y'know, some lava lamps and a camera.
Oh, and as to the code you have; depending on the language, you may need to seed the RNG, or it'll generate the same key every time.
Whatever you do, if you wind up generating a key that uses all numbers and all letters, and if a person is ever going to see that key (which is likely if you are using numbers and letters), omit the characters l, I, 1, O, and 0. People get them confused.
Nothing in your post addresses the question of uniqueness. You're going to have to have some way of not generating the same key twice. Usually, when I need a unique key, I have some unique information to start with. I usually take a one-way hash like MD5, then there are ways to convert that to a key with varying degrees of readability:
Convert to hex
Base64 encode it
Use bits of of the key to index into a list of words.
Example: the unique string computed by hashing the part of this answer above the horizontal line is
abduction's brogue's melted bragger's
You could do a base64 encoding of some random data and remove the +, /, and = characters from the result? I don't know if this would make a predictable distribution. Also, it seems like more work that what you're doing now, which is a fine solution.
Assuming you're using a language/library without an utterly pathetic random number generator, what you've got looks pretty good. N symbols uniformly distributed over a reasonable alphabet works for me, and no amount of applying fancier code is likely to make it more random (just slower).
(For the record, pathetic would include ditching the high-order bits of the underlying random numbers when choosing a value from the given range. While ideally all RNGs would make every bit equally random, in practice that's not so; the higher-order bits tend to be more random. This means that the modulus operator is totally the wrong thing to use when clamping to a restricted range.)

Resources