I have a website that gives users different outcomes depending on a virtual dice roll. I want them to trust that my random numbers are honest, so instead of me determining it in my own code (which to my skeptical users is a black box I can manipulate), I want to come up with some other mechanism.
One idea is to point to some credible website (e.g. governmental) that has a publicly observable random number that changes over time. Then I could say, "We will base your outcome on a number between 0 and 9, which will be the number at [url] in 10 seconds."
Any suggestions?
I'd go with this site myself. It has a public anonymous URL for several kinds of numbers, and realtime pages to observe them:
Hex numbers
Number: https://qrng.anu.edu.au/ran_hex.php
Stream: https://qrng.anu.edu.au/RainHex.php
Binary numbers
Number: https://qrng.anu.edu.au/ran_bin.php
Stream: https://qrng.anu.edu.au/RainBin.php
It also includes references to the scientific explanation of the source of randomness, and practical demonstrations of it, even one specifically for dice.
From your code you can just retrieve the number URL mentioned above.
Alternative if verifiability is important
A completely alternative approach: when the deadline falls, retrieve the homepage of an outside controlled, high-traffic interactive site, such as the questions page of Stack Overflow. Store the page, take its MD5 or SHA1 hash, and derive your roll from that.
You can then:
Show the page as it was at the snapshot time to verify it's working HTML
Its HTML source full of timestamps to verify authenticity and time of retrieval to nearly the second
Let people verify the hash for themselves based on that
Guarantee randomness of the value because it is mathematically impossible to predict what you need to change on a site like SO to trigger a given new hash value
Any attempt to tamper with this system, such as Jeff reiterating an old page on purpose because he knows the MD5 hash it produces, is easily debunked by the real time nature of the site - it would be visible for everyone to see the questions aren't recent to the time of snapshot.
Related
Imagine that you are hosting a debate with n participants and you wish to split them in half in a completely random fashion.
One might do so by creating a list of participants, randomly shuffling that list, then forcing the first n/2 participants to debate as a team.
If this debate was particularly important, however, we would want to ensure that the teams we have created are provably random in a way that is publicly visible. We want to be able to show that the teams we have created are not the direct result of any human's decisions. Would there be a way to do this?
I believe this problem boils down to the issue of creating a seed for a random number generator that is based on the state of the world at a particular time, but I'm not sure. Is this a problem software engineers have tackled before, and is there an API out there for this?
What you may be thinking of is verifiable random numbers, which are random numbers generated using data that will be disclosed along with all the information needed to verify them. The most prolific use of verifiable random numbers in practice I am aware of is found in the selection procedure for the Internet Engineering Task Force's Nominations Committee (or NomCom for short). RFC 3797 describes this selection procedure, as well as how verifiable random selection works in general.
Another related technology is the verifiable delay function, which is a function that takes noticeable time to compute (for example, to hash publicly disclosed data to a random-looking number) but for which it's easy to verify whether the output is correct. This is described in two works, among others:
Lenstra, A.K., Wesolowski, B. "A random zoo: sloth, unicorn, and trx", 2015 (before the concept was coined).
Boneh, D., Bonneau, J., et al. "Verifiable Delay Functions", 2018.
BCrypt hashes usually begin with some repeating symbols.
Lets say for example we see $2a$10 as a beginning of our hash.Every BCrypt hash has something similar to this.
$ is a separator
2a is in this case the version
10 is the number of iterations 2 to the power of 10
My question is - why is this information in the hash?
There is no dehashing algorithm that might need this information in particular and when people log-in they generate the same has using the same version and the same number of iterations and then the result is compared to what is stored in the database. This means that the algorithm doesn't have build in comparing function that gets the has and based on this information (version and iterations) hashes the password to make the comparison.
Then...why is it so that this information is given away? Who uses this information?
My guess is so that if the version has changed or the number of iterations our program or whatever will know, but...why? I mean that the algorithm must be configured only once and if changes are required then it is the company's job to make the appropriate arrangements so that it knows what version was used and what is used now. Why is it the hash's job to remember the version and number of iterations?
Hashes get leaked every week or so and with this information someone can easily set up his BCrypt and make it running with the same configuration of version and iterations...however if this information wasn't visible in the hash and the hash got public...then how would anyone make their own BCrypt version and start comparing it?
Isn't it more safe to not provide this information so that if the hash alone gets leaked nobody would know what configuration was used to make it?
It makes bcrypt forward and backwards compatible.
For example, bcrypt hashes do not start with 2a.
They start with:
$2$
$2a
$2x$
$2y$
$2b$
You need to know which version of the has you're reading, so you handle it correctly.
Also, you need to know the number of iterations.
not every hash will use 10
not every hash will use the same cost
Why store the version and iterations? Because you have to.
Also, it's an excellent. In the past, people used to just store a hash, and it was awful.
people used either no salt, or the same salt every time, because storing it was too hard
people uses 1 iteration, or a hard-coded number of iterations, because storing it was too hard
BCrypt, SCrypt, and Argon2 use the extraordinarily clever idea of doing all that grunt-work for you, leaving you with only having to call one function, without weakening the security of the system in any way.
you're trying to preach security by obscurity. Stop that, it doesn't work.
not having details in the data is not unusual, in this old hack it was hackers that mentioned it was SHA1. This is easy - the attackers, and researchers too, will take the list of data that was leaked and simply try all kinds of common algorithms and interation/work factor counts with a small list of the common passwords, like the phpbb list from SkullSecurity; when they find the inevitable cracked terrible, passwords, they'll know they found the algorithm and break out the full scale cracking.
having the algorithm stored means you can transition from old to new gradually, and upgrade individual users as they come in, AND have multiple variants in use at one - including transitional types
transitional: you were on salted SHA-1 (BAD), moving to PBKDF2-HMAC-SHA-512 with 200,000 iterations (good), in the middle you actually bulk convert to PBKDF2-HMAC-SHA-512(SHA-1(salted password)), but at each user's login, move them to pure PBKDF2-HMAC-SHA-512(password).
having the iteration count means, like the transitional above, you can increase it over time and have different counts for different users set as they log in.
I came into this question during an interview.
Let say we have log information for user visiting a website, the information including website, user, time. We need to design a data structure that we can get the information of
Top five visiting user of a specific website
Top five website visit by a specific user
Websites that are only be visited for 100 times in a day
All in real time
The first thought that I came in mind is that we can just use a database to store the log and every time we just need to do counting and sorting for each user or each website. But it's not real time as we need to do a lot of computation to get the information.
Then I think we can use HashMap for each problem. For example, for each website we use HashMap<Website, <TreeMap<User, count>>, so that we can get the top five visitor for a specific website. But the interviewer said we can only use one data structure for all three problem as the second problem would use HashMap<User, <TreeMap<Website, count>>, which has different key and value type.
Can anyone think of a good solution for this problem?
A map of maps, with generic types, as a basic approach.
The first map represents the global data structure, which will contain the maps for the three problems.
The first inside map, you'll have the website as the key and a list of the top 5 users.
The second inside map, you'll have the user as a key and a list of the top 5 visited website by him.
For the last problem, you can have the websites as key and the number of visitors as value on the third inside map.
If what they meant was to have the same data structure for three different problems, thant forget about the global map.
If you want to go a litle deeper, you can consider using a adjacency matrix implementation where your users and websites identifications are your columns/rows identifiers and the values are the number of visitors.
I am designing an website for experiment, there will be a button which user must click and hold for a while, then release, then client submits AJAX event to server.
However, to prevent from autoclick bots and fast spam, I want the hold time to be very real and not skip-able, e.g. doing some calculation. The point is to waste actual CPU time, so that you can't simply guess the AJAX callback value or turning faster system clock to bypass it.
Are there any algorithm that
fast & easy to generate a challenge on a server
costs some time to execute on the client side, no spoof or shortcut the time.
easy & fast to verify the response result on a server?
You're looking for a Proof-of-work system.
The most popular algorithm seems to be Hashcash (also on Wikipedia), which is used for bitcoins, among other things. The basic idea is to ask the client program to find a hash with a certain number of leading zeroes, which is a problem they have to solve with brute force.
Basically, it works like this: the client has some sort of token. For email, this is usually the recipient's email address and today's date. So it could look like this:
bob#example.com:04102011
The client now has to find a random string to put in front of this:
asdlkfjasdlbob#example.com:04202011
such that the hash of this has a bunch of leading 0s. (My example won't work because I just made up a number.)
Then, on your side, you just have to take this random input and run a single hash on it, to check if it starts with a bunch of 0s. This is a very fast operation.
The reason that the client has to spend a fair amount of CPU time on finding the right hash is that it is a brute-force problem. The only know want to do it is to choose a random string, test it, and if it doesn't work, choose another one.
Of course, since you're not doing emails, you will probably want to use a different token of some sort rather than an email address and date. However, in your case, this is easy: you can just make up a random string server-side and pass it to the client.
One advantage of this particular algorithm is that it's very easy to adjust the difficulty: just change how many leading zeroes you want. The more zeroes you require, the longer it will take the client; however, verification still takes the same amount of time on your end.
Came back to answer my own question. This is called Verifiable Delay Function
The concept was first proposed in 2018 by Boneh et al., who proposed several candidate structures for constructing verifiable delay functions and it is an important tool to add time delay in decentralized applications. To be exact, the verifiable delay function is a function f:X→Y that takes a prescribed wall-clock time to compute, even on a parallel processor, ond outputs a unique result that can effectively output the verification. In short, even if it is evaluated on a large number of parallel processors and still requires evaluation of f in a specified number of sequential steps
https://www.mdpi.com/1424-8220/22/19/7524
The idea of VDF is a step forward than #TikhonJelvis's PoW answer because apparently it "takes a prescribed wall-clock time to compute, even on a parallel processor"
I have a project that needs to do validation on the frontend for an American Social Security Number (format ddd-dd-dddd). One suggestion would be to use a hash algorithm, but given the tiny character set used ([0-9]), this would be disastrous. It would be acceptable to validate with some high probability that a number is correct and allow the backend to do a final == check, but I need to do far better than "has nine digits" etc etc.
In my search for better alternatives, I came upon the validation checksums for ISBN numbers and UPC. These look like a great alternative with a high probability of success on the frontend.
Given those constraints, I have three questions:
Is there a way to prove that an algorithm like ISBN13 will work with a different category of data like SSN, or whether it is more or less fit to the purpose from a security perspective? The checksum seems reasonable for my quite large sample of one real SSN, but I'd hate to find out that they aren't generally applicable for some reason.
Is this a solved problem somewhere, so that I can simply use a pre-existing validation scheme to take care of the problem?
Are there any such algorithms that would also easily accommodate validating the last 4 digits of an SSN without giving up too much extra information?
Thanks as always,
Joe
UPDATE:
In response to a question below, a little more detail. I have the customer's SSN as previously entered, stored securely on the backend of the app. What I need to do is verification (to the maximum extent possible) that the customer has entered that same value again on this page. The issue is that I need to prevent the information from being incidentally revealed to the frontend in case some non-authorized person is able to access the page.
That is why an MD5/SHA1 hash is inappropriate: namely that it can be used to derive the complete SSN without much difficulty. A checksum (say, modulo 11) provides nearly no information to the frontend while still allowing a high degree of accuracy for the field validation. However, as stated above I have concerns over its general applicability.
Wikipedia is not the best source for this kind of thing, but given that caveat, http://en.wikipedia.org/wiki/Social_Security_number says
Unlike many similar numbers, no check digit is included.
But before that it mentions some widely used filters:
The SSA publishes the last group number used for each area number. Since group numbers are allocated in a regular (if unusual) pattern, it is possible to identify an unissued SSN that contains an invalid group number. Despite these measures, many fraudulent SSNs cannot easily be detected using only publicly available information. In order to do so there are many online services that provide SSN validation.
Restating your basic requirements:
A reasonably strong checksum to protect against simple human errors.
"Expected" checksum is sent from server -> client, allowing client-side validation.
Checksum must not reveal too much information about SSN, so as to minimize leakage of sensitive information.
I might propose using a cryptographic has (SHA-1, etc), but do not send the complete hash value to the client. For example, send only the lowest 4 bits of the 160 bit hash result[1]. By sending 4 bits of checksum, your chance of detecting a data entry error are 15/16-- meaning that you'll detect mistakes 93% of the time. The flip side, though, is that you have "leaked" enough info to reduce their SSN to 1/16 of search space. It's up to you to decide if the convenience of client-side validation is worth this leakage.
By tuning the number of "checksum" bits sent, you can adjust between convenience to the user (i.e. detecting mistakes) and information leakage.
Finally, given your requirements, I suspect this convenience / leakage tradeoff is an inherent problem: Certainly, you could use a more sophisticated crypto challenge / response algorithm (as Nick ODell astutely suggests). However, doing so would require a separate round-trip request-- something you said you were trying to avoid in the first place.
[1] In a good crypto hash function, all output digits are well randomized due to avalanche effect, so the specific digits you choose don't particularly matter-- they're all effectively random.
Simple solution. Take the number mod 100001 as your checksum. There is 1/100_000 chance that you'll accidentally get the checksum right with the wrong number (and it will be very resistant to one or two digit mistakes canceling out), and 10,000 possible SSNs that it could be so you have not revealed the SSN to an attacker.
The only drawback is that the 10,000 possible other SSNs are easy to figure out. If the person can get the last 4 of the SSN from elsewhere, then they can probably figure out the SSN. If you are concerned about this then you should take the user's SSN number, add a salt, and hash it. And deliberately use an expensive hash algorithm to do so. (You can just iterate a cheaper algorithm, like MD5, a fixed number of times to increase the cost.) Then use only a certain number of bits. The point here being that while someone can certainly go through all billion possible SSNs to come up with a limited list of possibilities, it will cost them more to do so. Hopefully enough that they don't bother.