BCrypt hashes usually begin with some repeating symbols.
Lets say for example we see $2a$10 as a beginning of our hash.Every BCrypt hash has something similar to this.
$ is a separator
2a is in this case the version
10 is the number of iterations 2 to the power of 10
My question is - why is this information in the hash?
There is no dehashing algorithm that might need this information in particular and when people log-in they generate the same has using the same version and the same number of iterations and then the result is compared to what is stored in the database. This means that the algorithm doesn't have build in comparing function that gets the has and based on this information (version and iterations) hashes the password to make the comparison.
Then...why is it so that this information is given away? Who uses this information?
My guess is so that if the version has changed or the number of iterations our program or whatever will know, but...why? I mean that the algorithm must be configured only once and if changes are required then it is the company's job to make the appropriate arrangements so that it knows what version was used and what is used now. Why is it the hash's job to remember the version and number of iterations?
Hashes get leaked every week or so and with this information someone can easily set up his BCrypt and make it running with the same configuration of version and iterations...however if this information wasn't visible in the hash and the hash got public...then how would anyone make their own BCrypt version and start comparing it?
Isn't it more safe to not provide this information so that if the hash alone gets leaked nobody would know what configuration was used to make it?
It makes bcrypt forward and backwards compatible.
For example, bcrypt hashes do not start with 2a.
They start with:
$2$
$2a
$2x$
$2y$
$2b$
You need to know which version of the has you're reading, so you handle it correctly.
Also, you need to know the number of iterations.
not every hash will use 10
not every hash will use the same cost
Why store the version and iterations? Because you have to.
Also, it's an excellent. In the past, people used to just store a hash, and it was awful.
people used either no salt, or the same salt every time, because storing it was too hard
people uses 1 iteration, or a hard-coded number of iterations, because storing it was too hard
BCrypt, SCrypt, and Argon2 use the extraordinarily clever idea of doing all that grunt-work for you, leaving you with only having to call one function, without weakening the security of the system in any way.
you're trying to preach security by obscurity. Stop that, it doesn't work.
not having details in the data is not unusual, in this old hack it was hackers that mentioned it was SHA1. This is easy - the attackers, and researchers too, will take the list of data that was leaked and simply try all kinds of common algorithms and interation/work factor counts with a small list of the common passwords, like the phpbb list from SkullSecurity; when they find the inevitable cracked terrible, passwords, they'll know they found the algorithm and break out the full scale cracking.
having the algorithm stored means you can transition from old to new gradually, and upgrade individual users as they come in, AND have multiple variants in use at one - including transitional types
transitional: you were on salted SHA-1 (BAD), moving to PBKDF2-HMAC-SHA-512 with 200,000 iterations (good), in the middle you actually bulk convert to PBKDF2-HMAC-SHA-512(SHA-1(salted password)), but at each user's login, move them to pure PBKDF2-HMAC-SHA-512(password).
having the iteration count means, like the transitional above, you can increase it over time and have different counts for different users set as they log in.
Related
I am planning on generating a set of public/private keys from a deterministic identifying piece of information from a person and was planning on using fingerprints.
My question, therefore, is: what is the output of a fingerprint scanner? Is there any deterministic output I could use, or is it always going to be a matter of "confidence level"? i.e. Do I always get a "number" which, if matched exactly to the database, will allow access, or do I rather get a number which, if "close enough" to the stored value on the database, allows access, based on a high degree of confidence, rather than an exact match?
I am quite sure the second option is the answer but just wanted to double-check. Is there any way to get some sort of deterministic output? My hope was to re-generate keys every time rather than actually storing fingerprint data. That way a wrong fingerprint would simply generate a new and useless key.
Any suggestions?
Thanks in advance.
I would advise against it for several reasons.
The fingerprints are not entirely deterministic. As suggested in #ImSimplyAnna answer, you might 'round' the results in order to have more chances to obtain a deterministic result. But that would significantly reduce the number of possible/plausible fingerprints, and thus not meet the search space size requirement for a cryptographic algorithm. On top of it, I suspect the entropy of such result to be somehow low, compared to the requirements of modern algorithm which are always based on high quality random numbers.
Fingerprints are not secret, we expose them to everyone all the time, and they can be revealed to an attacker at any time, and stored in a picture using a simple camera. A key must be a secret, and the only place we know we can store secrets without exposing them is our brain (which is why we use passwords).
An important feature for cryptographic keys is the possibility to generate new one if there is a reason to believe the current ones might be compromised. This is not possible with fingerprints.
That is why I would advise against it. Globally, I discourage anyone (myself included) to write his/her own cryptographic algorithm, because it is so easy to screw them up. It might be the easiest thing to screw up, out of all the things you could write, because attacker are so vicicous!
The only good approach, if you're not a skilled specialist, is to use libraries that are used all around, because they've been written by experts on the matter, and they've been subject to many attacks and attempts to break them, so the ones still standing will offer much better levels of protection that anything a non specialist could write (or basically anything a single human could write).
You can also have a look at this question, on the crypto stack exchange. They also discourage the OP in using anything else than a battle hardened algorithm, or protocol.
Edit:
I am planning on generating a set of public/private keys from a
deterministic identifying piece of information
Actually, It did not strike me at first (it should have), but keys MUST NOT be generated from anything which is not random. NEVER.
You have to generate them randomly. If you don't, you already give more information to the attacker than he/she wants. Being a programmer does not make you a cryptographer. Your user's informations are at stake, do not take any chance (and if you're not a cryptographer, you actually don't stand any).
A fingerprint scanner looks for features where the lines on the fingerprint either split or end. It then calculates the distances and angles between such features in an attempt to find a match.
Here's some more reading on the subject:
https://www.explainthatstuff.com/fingerprintscanners.html
in the section "How fingerprints are stored and compared".
The source is the best explanation I can find, but looking around some more it seems that all fingerprint scanners use some variety of that algorithm to generate data that can be matched.
Storing raw fingerprints would not only take up way more space on a database but also be a pretty significant security risk if that information was ever leaked, so it's not really done unless absolutely necessary.
Judging by that algorithm, I would assume that there is always some "confidence level". The angles and distances will never be 100% equal between scans, so there has to be some leeway to make sure a match is still found even if the finger is pressed against the scanner a bit harder or the finger is at a slightly different angle.
Based on this, I'd assume that generating a key pair based on a fingerprint would be possible, if you can figure out a way to make similar scans result in the same information. Simply rounding the angles and distances may work, but may introduce cases where two different people generate the same key pairs, or cases where different scans of the same fingerprint have a high chance of generating several different keys.
I was researching about how MD5 is known to have collisions, So its not secure enough. I am looking for some hashing algorithm that even super computers will take time to break.So can you tell me what hashing algorithm will keep my passwords safe for like next coming 20 years of super computing advancement.
Use a key derivation function with a variable number of rounds, such as bcrypt.
The passwords you encrypt today, with a hashing difficulty that your own system can handle without slowing down, will always be vulnerable to the faster systems of 20 years in the future. But by increasing the number of rounds gradually over time you can increase the amount of work it takes to check a password in proportion with the increasing power of supercomputers. And you can apply more rounds to existing stored passwords without having to go back to the original password.
Will it hold up for another 20 years? Difficult to say: who knows what crazy quantum crypto and password-replacement schemes we might have by then? But it certainly worked for the last 10.
Note also that entities owning supercomputers and targeting particular accounts are easily going to have enough power to throw at it that you can never protect all of your passwords. The aim of password hashing is to mitigate the damage from a database leak, by limiting the speed at which normal attackers can recover passwords, so that as few accounts as possible have already been compromised by the time you've spotted the leak and issued a notice telling everyone to change their passwords. But there is no 100% solution.
As someone else said, what you're asking is practically impossible to answer. Who knows what breakthroughs will be made in processing power over the next twenty years? Or mathematics?
In addition you aren't telling us many other important factors, including against which threat models you aim to protect. For example, are you trying to defend against an attacker getting a hold of a hashed password database and doing offline brute-forcing? An attacker with custom ASICs trying to crack one specific password? Etc.
With that said, there are things you can do to be as secure and future-proof as possible.
First of all, don't just use vanilla cryptographic hash algorithms; they aren't designed with your application in mind. Indeed they are designed for other applications with different requirements. For one thing, they are fast because speed is an important criterion for a hash function. And that works against you in this case.
Additionally some of the algorithms you mention, like MD5 or SHA1 have weaknesses (some theoretical, some practical) and should not be used.
Prefer something like bcrypt, an algorithm designed to resist brute force attacks by being much slower than a general purpose cryptographic hash whose security can be “tuned” as necessary.
Alternatively, use something like PBKDF2 which is. Designed to run a password through a function of your choice a configurable number of times along with a salt, which also makes brute forcing much more difficult.
Adjust the iteration count depending on your usage model, keeping in mind that the slower it is, the more security against brute-force you have.
In selecting a cryptographic hash function for PBKDF, prefer SHA-3 or, if you can't use that, prefer one of the long variants of SHA-2: SHA-384 or SHA-512. I'd steer clear of SHA-256 although I don't think there's an issue with it in this scenario.
In any case, use the largest possible and best salt you can; I'd suggest that you use a good cryptographically secure PRNG and never use a salt less than 64 bits (note: that I am talking about the length of the salt generated, not the value returned).
Will these recommendations help 20 years down the road? Who knows - I'd err on the side of caution and say "no". But if you need security for that long a timeframe, you should consider using something other than passwords.
Anyways, I hope this helps.
Here are two pedantic answers to this question:
If P = NP, there is provably no such hash function (and vice versa, incidentally). Since it has not been proven that P != NP at the time of this writing, we cannot make any strong guarantees of that nature.
That being said, I think it's safe to say that supercomputers developed within the next 20 years will take "time" to break your hash, regardless of what it is. Even if it is in plaintext some time is required for I/O.
Thus, the answer to your question is both yes and no :)
I have a project that needs to do validation on the frontend for an American Social Security Number (format ddd-dd-dddd). One suggestion would be to use a hash algorithm, but given the tiny character set used ([0-9]), this would be disastrous. It would be acceptable to validate with some high probability that a number is correct and allow the backend to do a final == check, but I need to do far better than "has nine digits" etc etc.
In my search for better alternatives, I came upon the validation checksums for ISBN numbers and UPC. These look like a great alternative with a high probability of success on the frontend.
Given those constraints, I have three questions:
Is there a way to prove that an algorithm like ISBN13 will work with a different category of data like SSN, or whether it is more or less fit to the purpose from a security perspective? The checksum seems reasonable for my quite large sample of one real SSN, but I'd hate to find out that they aren't generally applicable for some reason.
Is this a solved problem somewhere, so that I can simply use a pre-existing validation scheme to take care of the problem?
Are there any such algorithms that would also easily accommodate validating the last 4 digits of an SSN without giving up too much extra information?
Thanks as always,
Joe
UPDATE:
In response to a question below, a little more detail. I have the customer's SSN as previously entered, stored securely on the backend of the app. What I need to do is verification (to the maximum extent possible) that the customer has entered that same value again on this page. The issue is that I need to prevent the information from being incidentally revealed to the frontend in case some non-authorized person is able to access the page.
That is why an MD5/SHA1 hash is inappropriate: namely that it can be used to derive the complete SSN without much difficulty. A checksum (say, modulo 11) provides nearly no information to the frontend while still allowing a high degree of accuracy for the field validation. However, as stated above I have concerns over its general applicability.
Wikipedia is not the best source for this kind of thing, but given that caveat, http://en.wikipedia.org/wiki/Social_Security_number says
Unlike many similar numbers, no check digit is included.
But before that it mentions some widely used filters:
The SSA publishes the last group number used for each area number. Since group numbers are allocated in a regular (if unusual) pattern, it is possible to identify an unissued SSN that contains an invalid group number. Despite these measures, many fraudulent SSNs cannot easily be detected using only publicly available information. In order to do so there are many online services that provide SSN validation.
Restating your basic requirements:
A reasonably strong checksum to protect against simple human errors.
"Expected" checksum is sent from server -> client, allowing client-side validation.
Checksum must not reveal too much information about SSN, so as to minimize leakage of sensitive information.
I might propose using a cryptographic has (SHA-1, etc), but do not send the complete hash value to the client. For example, send only the lowest 4 bits of the 160 bit hash result[1]. By sending 4 bits of checksum, your chance of detecting a data entry error are 15/16-- meaning that you'll detect mistakes 93% of the time. The flip side, though, is that you have "leaked" enough info to reduce their SSN to 1/16 of search space. It's up to you to decide if the convenience of client-side validation is worth this leakage.
By tuning the number of "checksum" bits sent, you can adjust between convenience to the user (i.e. detecting mistakes) and information leakage.
Finally, given your requirements, I suspect this convenience / leakage tradeoff is an inherent problem: Certainly, you could use a more sophisticated crypto challenge / response algorithm (as Nick ODell astutely suggests). However, doing so would require a separate round-trip request-- something you said you were trying to avoid in the first place.
[1] In a good crypto hash function, all output digits are well randomized due to avalanche effect, so the specific digits you choose don't particularly matter-- they're all effectively random.
Simple solution. Take the number mod 100001 as your checksum. There is 1/100_000 chance that you'll accidentally get the checksum right with the wrong number (and it will be very resistant to one or two digit mistakes canceling out), and 10,000 possible SSNs that it could be so you have not revealed the SSN to an attacker.
The only drawback is that the 10,000 possible other SSNs are easy to figure out. If the person can get the last 4 of the SSN from elsewhere, then they can probably figure out the SSN. If you are concerned about this then you should take the user's SSN number, add a salt, and hash it. And deliberately use an expensive hash algorithm to do so. (You can just iterate a cheaper algorithm, like MD5, a fixed number of times to increase the cost.) Then use only a certain number of bits. The point here being that while someone can certainly go through all billion possible SSNs to come up with a limited list of possibilities, it will cost them more to do so. Hopefully enough that they don't bother.
As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.
I want to generate unique filenames per image so I'm using MD5 to make filenames.Since two of the same image could come from different locations, I'd like to actually base the hash on the image contents. What caveats does this present?
(doing this with PHP5 for what it's worth)
It's a good approach. There is an extremely small possibility that two different images might hash to the same value, but in reality your data center has a greater probability of suffering a direct hit by an asteroid.
One caveat is that you should be careful when deleting images. If you delete an image record that points to some file and you delete the file too, then you may be deleting a file that has a different record pointing to the same image (that belongs to a different user, say).
Given completely random file contents and a good cryptographic hash, the probability that there will be two files with the same hash value reaches 50% when the number of files is roughly 2 to (number of bits in the hash function / 2). That is, for a 128 bit hash there will be a 50% chance of at least one collision when the number of files reaches 2^64.
Your file contents are decidedly not random, but I have no idea how strongly that influences the probability of collision. This is called the birthday attack, if you want to google for more.
It is a probabilistic game. If the number of images will be substantially less than 2^64, you're probably fine. If you're still concerned, using a combination of SHA-1 plus MD5 (as another answer suggested) gets you to a total of 288 high-quality hash bits, which means you'll have a 50% chance of a collision once there are 2^144 files. 2^144 is a mighty big number. Mighty big. One might even say huge.
You should use SHA-1 instead of MD5, because MD5 is broken. There are pairs of different files with the same MD5 hash (not theoretical; these are actually known, and there are algorithms to generate even more pairs). For your application, this means someone could upload two different images which would have the same MD5 hash (or someone could generate such a pair of images and publish them somewhere in the Internet such that two of your users will later try to upload them, with confusing results).
Seems fine to me, if you're ok with 32-character filenames.
Edit: I wouldn't use this as the basis of (say) the FBI's central database of terrorist mugshots, since a sufficiently motivated attacker could probably come up with an image that had the same MD5 as an existing one. If that was the case then you could use SHA1 instead, which is somewhat more secure.
You could use a UUID instead?
If you have two identical images loaded from different places, say a stock photo, then you could end up over-writing the 'original'. However, that would mean you're only storing one copy, not two.
With that being said, I don't see any big issues with doing it in the way you described.
It will be time consuming. Why don't you just assign them sequential ids?
You might want to look into the technology P2P networks use to identify duplicate files. A solution involving MD5, SHA-1, and file length would be pretty reliable (and probably overkill).
ImageMagick and the PHP class imagick, that access it are able to interpret images more subjectively than hashing functions by factors like colour. There are countless methods and user-preferences to consider so here are some resources covering afew approaches to see what might suit your intended application:
http://www.imagemagick.org/Usage/compare/
http://www.imagemagick.org/discourse-server/viewtopic.php?f=1&t=8968&start=0
http://galleryproject.org/node/11198#comment-39927
Any of the hashing functions like MD5 will only attempt to determine if the files are identical - bit-wise, not to check visual similarity (with a margin-of-error for lossy compression or slight crops).