Fast and (practically) collision free hash - performance

I have an object which calculates a (long) path. Two objects are equal if the calculates the same path. I previously tested if two objects were equal by just doing something like:
obj1.calculatePath() == obj2.calculatePath()
However, now this has become a performance bottleneck. I tried storing the path inside the object but since I have a lot of objects this became a memory issue instead.
I have estimated that a 64 bits hash should be enough to avoid collisions - assuming the hash is good (bijective).
So, since the usual fast hashes (Murmur etc.) do have collisions I would like to avoid them since it sounds like a headache when you can just use a hash like SHA-2. (it's much nicer if I can just trust the hash instead of doing additional checks in case the hashes of two objects match)
However, SHA is also "slow" compared to older hash functions (like the MD family) I wonder is it would be better to use something like MD5 or maybe even MD4.
So my question is: Assuming there are no evil hacker with a motive of creating collisions with specially crafted input - but only benign (random) inputs. Which hash function should I choose for a performance critical part of my code where I would like to avoid the added complexity of using an "insecure" hash like Murmur.

It's difficult to help without more information.
As it stands all anyone can recommend is a generic hash-function.
There's an element of give a few a go!
FNV-1a (http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
Is usually a not-too-shabby starting point.
It's (a) easy to implement, (b) not usually 'bad' , (c) is computationally cheap and so applicable to your 'long' path issue.
However what I want to know is:
What space are these paths in? Are the in (x,y,z,t) 'real' space-time (i.e. trajectories)? Are paths through some graph? Are they file paths? Something else?
It's difficult to say more without more context.

Related

Faster to call kind_of? or iterate through an array with one value?

In short, is the cost (in time and cpu) higher to call kind_of? twice or to create a new array with one value, then iterate through it? The 'backstory' below simply details why I need to know this, but is not a necessary read to answer the question.
Backstory:
I have a bunch of location data. Latitude/longitude pairs and the name of the place they represent. I need to sort these lat/lon values by distance from another lat/lon pair provided by a user. I have to calculate the distances on the fly, and they aren't known before.
I was thinking it would be easy to do this by adding the distance => placename map to a hash, then get a keyset and sort that, then read out the values in that order. However, there is the potential for two distances being equal, making two keys equal to each other.
I have come up with two solutions to this, either I map
if hash.has_key?(distance)
hash[distance].kind_of? Array
? hash[distance] << placename
: hash.merge!({distance => [hash[distance], placename]})
else
hash.merge!({distance => placename})
end
then when reading the values I check
hash[distance] kind_of? Array ? grab the placename : iterate through hash and grab all placenames
each time. Or I could make each value an array from the start even if it has only one placename.
You've probably spent more time thinking about the issue than you will ever save in CPU time. Developer brain time (both yours and others who will maintain the code when you're gone) is often much more precious than CPU cycles. Focus on code clarity.
If you get indications that your code is a bottleneck, it may be a good idea to benchmark it, but don't forget to benchmark both before and after any changes you make, to make sure that you are actually improving the code. It is surprisingly how often "optimizations" aren't improving the code at all, just making it harder to read.
To be honest, this sounds like a very negligible performance issue, so I'd say just go with whatever feels better to you.
If you really believe that this has a real world performance impact (and frankly, there are other areas of Ruby you should worry more about speed-wise), reduce your problem to the simplest form that still resembles your problem and use the Benchmark module:
http://www.ruby-doc.org/stdlib/libdoc/benchmark/rdoc/index.html
I would bet that you'll achieve both higher performance and better legibility using the built-in Enumerable#group_by method.
As others have said, it's likely that this isn't a bottleneck, that gains will be negligible in any case and that you should focus on other things!

How does one go about reverse engineering an algorithm?

I'm wondering how does one go about reversing an algorithm such as one for storing logins or pin codes.
Lets say I have an amount of data where:
7262627 -> ? -> 8172
5353773 -> ? -> 1132
etc. This is just an example. Or say a hex string that is tansformed into another.
&h8712 -> &h1283 or something like that.
How do I go about starting to figure out what that algorithm is? Where does one start?
Would you start trying different shifts, xors and hope something stands out? I'm sure there's a better way as this seems like stabbing in the dark.
Is it even practically possible to reverse engineer this kind of algorithm?
Sorry if this is a stupid question. Thanks for your help / pointers.
There are a few things people try:
Get the source code, or disassemble an executable.
Guess, based on the hash functions other people use. For example, a hash consisting of 32 hex digits might well be one or more repetitions of MD5, and if you can get a single input/output pair then it is quite easy to confirm or refute this (although see "salt", below).
Statistically analyze a large number of pairs of inputs and outputs, looking for any kind of pattern or correlations, and relate those correlations to properties of known hash functions and/or possible operations that the designer of the system might have used. This is beyond the scope of a single technique, and into the realms of general cryptanalysis.
Ask the author. Secure systems don't usually rely on the secrecy of the hash algorithms they use (and don't usually stay secure long if they do). The examples you give are quite small, though, and secure hashing of passwords would always involve a salt, which yours apparently don't. So we might not be talking about the kind of system where the author is confident to do that.
In the case of a hash where the output is only 4 decimal digits, you can attack it simply by building a table of every possible 7 digit input, together with its hashed value. You can then reverse the table and you have your (one-to-many) de-hashing operation. You never need to know how the hash is actually calculated. How do you get the input/output pairs? Well, if an outsider can somehow specify a value to be hashed, and see the result, then you have what's called a "chosen plaintext", and an attack relying on that is a "chosen plaintext attack". So a 7 digit -> 4 digit hash would be very weak indeed if it was used in a way which allowed chosen plaintext attacks to generate a lot of input/output pairs. I realise that's just one example, but it's also just one example of a technique to reverse it.
Note that reverse engineering the hash, and actually reversing it, are two different things. You could figure out that I'm using SHA-256, but that wouldn't help you reverse it (i.e., given an output, work out the input value). Nobody knows how to fully reverse SHA-256, although of course there are always rainbow tables (see "salt", above) <conspiracy>At least nobody admits they do, so it's no use to you or me.</conspiracy>
Probably, you can't. Suppose the transformation function is known, something like
function hash(text):
return sha1("secret salt"+text)
But the "secret salt" is not known, and is cryptographically strong (a very large, random integer). You could never brute force the secret salt from even a very large number of plain-text, crypttext pairs.
In fact, if the precise hash function used was known to be one of two equally strong functions, you could never even get a good guess between which one was being used.
Stabbing in the dark will drive you to insanity. There are some algorithms that, given current understanding, you couldn't hope to deduce the inner workings of between now and the [predicted] end of the universe without knowing the exact details (potentially including private keys or internal state). Of course, some of these algorithms are the foundations of modern cryptography.
If you know in advance that there's a pattern to be discovered though, there are sometimes ways of approaching this. For instance, if the dataset contains several input values that differ by 1, compare the corresponding output values:
7262627 -> 8172
7262628 -> 819
7262629 -> 1732
...
7262631 -> 3558
Here it's fairly clear (given a few minutes and a calculator) that when the input increases by 1, the output increases by 913 modulo 8266 (i.e. a simple linear congruential generator).
Differential cryptanalysis is a relatively modern technique used to analyse the strength of cryptographic block ciphers, relying on a similar but more complex idea for where the cipher algorithm is known, but it's assumed the private key isn't. Input blocks differing from each other by a single bit are considered and the effect of that bit is traced through the cipher to deduce how likely each output bit is to "flip" as a result.
Other ways of approaching this kind of problem would be to look at the extremes (maximum, minimum values), distribution (leading to frequency analysis), direction (do the numbers always increase? decrease?) and (if this is allowed) consider the context in which the data sets were found. For instance, some types of PIN codes always contain a repeated digit to make them easier to remember (I'm not saying a PIN code can necessarily be deduced from anything else - just that a repeated digit is one less digit to worry about!).
Is it even practically possible to reverse engineer this kind of algorithm?
It is possible with a flawed algorithm and enough encrypted/unencrypted pairs, but a well designed algorithm can eliminate that possibility of doing it at all.

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

calculating which strings will have the same hash

with SHA-1 is it possible to figure out which finite strings will render equal hashes?
What you are looking for is the solution to the Collision Problem (See also collision attack). A well-designed and powerful cryptographic hash function is designed with the intent of as much obfuscating mathematics as possible to make this problem as hard as possible.
In fact, one of the measures of a good hash function is the difficulty of finding collisions. (Among the other measures, the difficulty of reversing the hash function)
It should be noted that, in hashes where the input is any length of string and the output is a fixed-length string, the Pigeonhole Principle ensures that there is at least one collision for any given string. However, finding this string is not as easy, as it would require basically blind guess-and-check over a basically infinite collection of strings.
It might be useful to read into the the ideal hash functions. Hash functions are designed to be functions where
Small changes in the input cause radical, chaotic changes in the output
Collisions are reduced to a minimum
It is difficult or, ideally, impossible to reverse
There are no hashed values that are impossible to obtain with any inputs (this one matters significantly less for cryptographic purposes)
The theoretical "perfect" hash algorithm would be a "random oracle" -- that is, for every input, it outputs a perfectly random output, on the condition that for the same input, the output will be identical (this condition is fulfilled with magic, by the hand of Zeus and pixie fairies, or in a way that no human could ever possibly understand or figure out)
Unfortunately, this is pretty much impossible, and ultimately, all hashes are judged as "strong" based on how much of these qualities they possess, and to what degree.
A hash like SHA1 or MD5 is going to be pretty strong, and more or less computationally impossible to find collisions for (within a reasonable time frame). Ultimately, you don't need to find a hash that is impossible to find collisions for. You only practically need one where the difficulty of it is large enough that it'd be too expensive to compute (ie, on the order of a billion or a trillion years to find a collision)
Due to all hashes being imperfect, one could analyze the internal workings of it and see mathematical patterns and heuristics and try to find collisions along that pattern. This is similar to a hash function being %7...Hashing the number 13 would be 13%7 = 6, 89%7 = 5. If you saw a hash of 3, you could use your mathematical understanding of the modulus function to easily find a collision (ie, 10)1. Fortunately for us, stronger hash functions have much, much, much harder to understand mathematical basis. (Ideally, so hard that no human would ever understand it!)
Some figures:
Finding a collision for a single given SHA-0 hash takes about 13 full days of running computations on the top supercomputers in the world, using the patterns inherent in the math.
According to a helpful commenter, MD5 collisions can be generated "quickly" enough to be less than ideal for sensitive purposes.
No feasible or practical/usable collision finding method for SHA-1 has been found or proven so far, although, as pointed out in the comments, there are some weaknesses that have been discovered.
Here is a similar SO question, which has answers much wiser than mine.
1note that, while this hash function is weak for collisions, it is strong it that it is perfectly impossible to go backwards and find a given key, if your hash is, say, 4. There are an infinite amount (ie, 4, 11, 18, 25...)
The answer is pretty clearly yes, since at the very least you could run through every possible string of the given length, compute the hashes of all of them, and then see which are the same. The more interesting question is how to do it quickly.
Further reading: http://en.wikipedia.org/wiki/Collision_attack
It depends on the hash function. With a simple hash function, it may be possible. For example, if the hash function simply sums the ASCII byte values of a string, then one could enumerate all strings of a given length that produce a given hash value. If the hash function is more complex and "cryptographically strong" (e.g., MD5 or SHA1), then it is theoretically not possible.
Most hashes are of cryptographic or near-cryptographic strength, so the hash depends on the input in non-obvious ways. The way this is done professionally is with rainbow tables, which are precomputed tables of inputs and their hashes. So brute force checking is basically the only way.

Efficient storage of external index of strings

Say you have a large collection with n objects on disk and each one has a variable-sized string. What are common practices of efficient ways to make an index of those objects with plain string comparison. Storing the whole strings on the index would be prohibitive in the long rundue to size and I/O, but since disks have a high latency storing only references isn't a good idea, either.
I've been thinking on using a B-Tree-like design with tries but can't find any database implementation using this approach. In fact, it's hard to find how major databases implement indexes for strings (it probably gets lost in the vast results for SQL-level information.)
TIA!
EDIT: changed title from "Efficient external sorting and searching of stored objects with large strings" to "Efficient storage of external index of strings."
A "prefix B-tree" or "simple prefix B-tree" would probably be helpful here.
A "simple prefix B-tree" is a bit simpler, just storing the shortest prefix that separates two items, without trying to eliminate redundancy within those prefixes (e.g. for 'astronomy' and 'azimuth', it would store just 'as' and 'az', but not try to keep from duplicating the 'a').
A "prefix B-tree" is close to what you've described -- something like a trie, but in a B-tree structure to give good characteristics when stored primarily on disk. Nonetheless, it's intended to remove (most of) the redundancy within the prefixes that form the index.
There is one other question: do you really need to traverse the records in order, or do you just need to look up a specified record quickly? If the latter is adequate, you might be able to use extendible hashing instead. Extendible hashing has been around (in a number of different forms) for a few decades, and still works pretty well. The general idea is fairly simple: hash the strings to create keys of fixed length, then create some sort of tree of those fixed-length pseudo-keys. As with (almost) any hash, you have to be prepared to deal with collisions. As with other hash tables, the details of the hashing and collision resolution vary (though probably not quite as much with extendible hashing as in-memory hashing).
As for real use, major DBMS and DBMS-like systems use all of the above. B-tree variants are probably the most common in the general purpose DBMS market (e.g. Oracle or MS SQL Server). Extendible hashing is used in a fair number of more-specialized products (e.g., Lotus Domino Server).
What are you doing with the objects?
If you're running a large system that needs low latency to handle lots of concurrent requests, then I'd store the objects in a database and have it take care of the sorting and indexing. This would be much simpler than implementing B-tree from scratch and possibly having it be buggy.
DBMSs also have caching and various other features that might make your life easier.
Start by being clear what you want. Do you want to sort them or index them? Sorting is likely to require moving at least some of the items on disk, but indexing would likely leave them where they are.
If you really want to sort them, Knuth's "The Art of Computer Programming" volume three covers sorting and searching in about as much details as you're likely to want.

Resources