Optimal bcrypt work factor - bcrypt

What would be an ideal bcrypt work factor for password hashing.
If I use a factor of 10, it takes approx .1s to hash a password on my laptop. If we end up with a very busy site, that turns into a good deal of work just checking people's passwords.
Perhaps it would be better to use a work factor of 7, reducing the total password hash work to about .01s per laptop-login?
How do you decide the tradeoff between brute force safety and operational cost?

Remember that the value is stored in the password: $2a$(2 chars work)$(22 chars salt)(31 chars hash). It is not a fixed value.
If you find the load is too high, just make it so the next time they log in, you crypt to something faster to compute. Similarly, as time goes on and you get better servers, if load isn't an issue, you can upgrade the strength of their hash when they log in.
The trick is to keep it taking roughly the same amount of time forever into the future along with Moore's Law. The number is log2, so every time computers double in speed, add 1 to the default number.
Decide how long you want it to take to brute force a user's password. For some common dictionary word, for instance, your account creation probably already warned them their password was weak. If it's one of 1000 common words, say, and it takes an attacker 0.1s to test each, that buys them 100s (well, some words are more common...). If a user chose 'common dictionary word' + 2 numbers, that's over two hours. If your password database is compromised, and the attacker can only get a few hundred passwords a day, you've bought most of your users hours or days to safely change their passwords. It's a matter of buying them time.
http://www.postgresql.org/docs/8.3/static/pgcrypto.html has some times for cracking passwords for you to consider. Of course, the passwords they list there are random letters. Dictionary words... Practically speaking you can't save the guy whose password is 12345.

Short Version
The number of iterations that gives at least 250 ms to compute
Long Version
When BCrypt was first published, in 1999, they listed their implementation's default cost factors:
normal user: 6
super user: 8
A bcrypt cost of 6 means 64 rounds (26 = 64).
They also note:
Of course, whatever cost people choose should be reevaluated
from time to time
At the time of deployment in 1976, crypt could hash fewer than 4 passwords per second. (250 ms per password)
In 1977, on a VAX-11/780, crypt (MD5) could be evaluated about 3.6 times per second. (277 ms per password)
That gives you a flavor of the kind of delays that the original implementers were considering when they wrote it:
~250 ms for normal users
~1 second for super users.
But, of course, the longer you can stand, the better. Every BCrypt implementation I've seen used 10 as the default cost. And my implementation used that. I believe it is time for me to to increase the default cost to 12.
We've decided we want to target no less than 250ms per hash.
My desktop PC is an Intel Core i7-2700K CPU # 3.50 GHz. I originally benchmarked a BCrypt implementation on 1/23/2014:
1/23/2014 Intel Core i7-2700K CPU # 3.50 GHz
| Cost | Iterations | Duration |
|------|-------------------|-------------|
| 8 | 256 iterations | 38.2 ms | <-- minimum allowed by BCrypt
| 9 | 512 iterations | 74.8 ms |
| 10 | 1,024 iterations | 152.4 ms | <-- current default (BCRYPT_COST=10)
| 11 | 2,048 iterations | 296.6 ms |
| 12 | 4,096 iterations | 594.3 ms |
| 13 | 8,192 iterations | 1,169.5 ms |
| 14 | 16,384 iterations | 2,338.8 ms |
| 15 | 32,768 iterations | 4,656.0 ms |
| 16 | 65,536 iterations | 9,302.2 ms |
Future Proofing
Rather than having a fixed constant, it should be a fixed minimum.
Rather than having your password hash function be:
String HashPassword(String password)
{
return BCrypt.HashPassword(password, BCRYPT_DEFAULT_COST);
}
it should be something like:
String HashPassword(String password)
{
/*
Rather than using a fixed default cost, run a micro-benchmark
to figure out how fast the CPU is.
Use that to make sure that it takes **at least** 250ms to calculate
the hash
*/
Int32 costFactor = this.CalculateIdealCost();
//Never use a cost lower than the default hard-coded cost
if (costFactor < BCRYPT_DEFAULT_COST)
costFactor = BCRYPT_DEFAULT_COST;
return BCrypt.HashPassword(password, costFactor);
}
Int32 CalculateIdealCost()
{
//Benchmark using a cost of 5 (the second-lowest allowed)
Int32 cost = 5;
var sw = new Stopwatch();
sw.Start();
this.HashPassword("microbenchmark", cost);
sw.Stop();
Double durationMS = sw.Elapsed.TotalMilliseconds;
//Increasing cost by 1 would double the run time.
//Keep increasing cost until the estimated duration is over 250 ms
while (durationMS < 250)
{
cost += 1;
durationMS *= 2;
}
return cost;
}
And ideally this would be part of everyone's BCrypt library, so rather than relying on users of the library to periodically increase the cost, the cost periodically increases itself.

The question was related to optimal and practical determination of cost factor for bcrypt password hashes.
On a system where you're calculating user password hashes for a server that you expect will grow in user population over time, why not make the duration of time that a user has had an account with your service the determining factor, perhaps including their login frequency as part of that determination.
Bcrypt Cost Factor = 6 + (Number years of user membership or some factor thereof) with an optional ceiling for total cost, perhaps modified in some way by the login frequency of that user.
Keep in mind though that using such a system or ANY system for determining cost factor effectively must include the consideration that the cost of calculating that hash could be used as a method of DDOS attack against the server itself by factoring in any method that increases the cost factor into their attack.

Related

What is the "spool time" of Intel Turbo Boost?

Just like a turbo engine has "turbo lag" due to the time it takes for the turbo to spool up, I'm curious what is the "turbo lag" in Intel processors.
For instance, the i9-8950HK in my MacBook Pro 15" 2018 (running macOS Catalina 10.15.7) usually sits around 1.3 GHz when idle, but when I run a CPU-intensive program, the CPU frequency shoots up to, say 4.3 GHz or so (initially). The question is: how long does it take to go from 1.3 to 4.3 GHz? 1 microsecond? 1 milisecond? 100 miliseconds?
I'm not even sure this is up to the hardware or the operating system.
This is in the context of benchhmarking some CPU-intensive code which takes a few 10s of miliseconds to run. The thing is, right before this piece of CPU-intensive code is run, the CPU is essentially idle (and thus the clock speed will drop down to say 1.3 GHz). I'm wondering what slice of my benchmark is running at 1.3 GHz and what is running at 4.3 GHz: 1%/99%? 10%/90%? 50%/50%? Or even worse?
Depending on the answer, I'm thinking it would make sense to run some CPU-intensive code prior to starting the benchmark as a way to "spool up" TurboBoost. And this leads to another question: for how long should I run this "spooling-up" code? Probably one second is enough, but what if I'm trying to minimize this -- what's a safe amount of time for "spooling-up" code to run, to make sure the CPU will run the main code at the maximum frequency from the very first instruction executed?
Evaluation of CPU frequency transition latency paper presents transition latencies of various Intel processors. In brief, the latency depends on the state in which the core currently is, and what is the target state. For an evaluated Ivy Bridge processor (i7-3770 # 3.4 GHz) the latencies varied from 23 (1.6 GH -> 1.7 GHz) to 52 (2.0 GHz -> 3.4 GHz) micro-seconds.
At Hot Chips 2020 conference a major transition latency improvement of the future Ice Lake processor has been presented, which should have major impact mostly at partially vectorised code which uses AVX-512 instructions. While these instructions do not support as high frequencies as SSE or AVX-2 instructions, using an island of these instructions cause down- and following up-scaling of the processor frequency.
Pre-heating a processor obviously makes sense, as well as "pre-heating" memory. One second of a prior workload is enough to reach the highest available turbo frequency, however you should take into account also temperature of the processor, which may down-scale the frequency (actually CPU core and uncore frequencies if speaking about one of the latest Intel processors). You are not able to reach the temperature limit in a second. But it depends, what you want to measure by your benchmark, and if you want to take into account the temperature limit. When speaking about temperature limit, be aware that your processor also has a power limit, which is another possible reason for down-scaling the frequency during the application run.
Another think that you should take into account when benchmarking your code is that its runtime is very short. Be aware of the runtime/resources consumption measurement reliability. I would suggest an artificially extending the runtime (run the code 10 times and measure the overall consumption) for better results.
I wrote some code to check this, with the aid of the Intel Power Gadget API. It sleeps for one second (so the CPU goes back to its slowest speed), measures the clock speed, runs some code for a given amount of time, then measures the clock speed again.
I only tried this on my 2018 15" MacBook Pro (i9-8950HK CPU) running macOS Catalina 10.15.7. The specific CPU-intensive code being run between clock speed measurements may also influence the result (is it integer only? FP? SSE? AVX? AVX-512?), so don't take these as exact numbers, but only order-of-magnitude/ballpark figures. I have no idea how the results translate into different hardware/OS/code combinations.
The minimum clock speed when idle in my configuration is 1.3 GHz. Here's the results I obtained in tabular form.
+--------+-------------+
| T (ms) | Final clock |
| | speed (GHz) |
+--------+-------------+
| <1 | 1.3 |
| 1..3 | 2.0 |
| 4..7 | 2.5 |
| 8..10 | 2.9 |
| 10..20 | 3.0 |
| 25 | 3.0-3.1 |
| 35 | 3.3-3.5 |
| 45 | 3.5-3.7 |
| 55 | 4.0-4.2 |
| 66 | 4.6-4.7 |
+--------+-------------+
So 1 ms appears to be the minimum amount of time to get any kind of change. 10 ms gets the CPU to its nominal frequency, and from then on it's a bit slower, apparently over 50 ms to reach maximum turbo frequencies.

How do I correctly adjust the Argon2 parameters in Go to consume less memory?

Argon2 by design is memory hungry. In the semi-official Go implementation, the following parameters are recommended when using IDKey:
key := argon2.IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32)
where 1 is the time parameter and 64*1024 is the memory parameter. This means the library will create a 64MB buffer when hashing a value. In scenarios where many hashing procedures might run at the same time this creates high pressure on the host memory.
In cases where this is too much memory consumption it is advised to decrease the memory parameter and increase the time factor:
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
So, assuming I would like to limit memory consumption to 16MB (1/4 of the recommended 64MB), it is still unclear to me how I should be adjusting the time parameter: is this supposed to be times 4 so that the product of memory and time stays the same? Or is there some other logic behind the correlation of time and memory at play?
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
I think the key here is the word "to compensate", so in this context it is trying to say: to achieve similar hashing complexity as IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32), you can try IDKey([]byte("some password"), salt, 4, 16*1024, 4, 32).
But if you want to decrease hashing result complexity (and decreasing performance overhead), you can decrease the size of memory uint32 disregarding the time parameter.
is this supposed to be times 4 so that the product of memory and time stays the same?
I dont think so, i believe the memory here means the length of result hash, but time parameter could mean "how many times the hashing result needs to be re-hash-ed until i get the end result".
So these 2 parameters are independent of each other. These are just controlling how much "brute- force cost savings due to time-memory tradeoffs" you want to achieve
Difficulty is roughly equal to time_cost * memory_cost (and possibly / parallelism). So if you 0.25x the memory cost, you should 4x the time cost. See also this answer.
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB.
Check out the Argon2 API itself. I'm going to cross-reference a little bit and use the argon2-cffi documentation. It looks like the go interface uses the C-FFI (foreign function interface) under the hood, so the protoype should be the same.
Parameters
time_cost (int) – Defines the amount of computation realized and therefore the execution time, given in number of iterations.
memory_cost (int) – Defines the memory usage, given in kibibytes.
parallelism (int) – Defines the number of parallel threads (changes the resulting hash value).
hash_len (int) – Length of the hash in bytes.
salt_len (int) – Length of random salt to be generated for each password.
encoding (str) – The Argon2 C library expects bytes. So if hash() or verify() are passed an unicode string, it will be encoded using this encoding.
type (Type) – Argon2 type to use. Only change for interoperability with legacy systems.
Indeed, if we look at the Go docs:
// The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number.
// If using that amount of memory (64 MB) is not possible in some contexts then
// the time parameter can be increased to compensate.
//
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB. For example
// memory=64*1024 sets the memory cost to ~64 MB. The number of threads can be
// adjusted to the numbers of available CPUs. The cost parameters should be
// increased as memory latency and CPU parallelism increases. Remember to get a
// good random salt.
I'm not 100% clear on the impact of thread count, but I believe it does parallelize the hashing, and like any multithreaded job, this reduces the total amount of time taken by approximately 1/N, for N cores. Apparently, you should essentially set the parallelism to cpu count.

How long does it to sort 2.5 million numbers using HeapSort?

I have 2.5 million entries/numbers, which I am using HeapSort to sort them by inserting in a sorted heap. But it is taking for ever.. I know heapsort running time is O(nlogn) but in real life, on a basic computer, how much time are we talking about here? I have 8 G.B. of RAM on my Windows machine, but I have dual-booted Ubuntu which I believe is chosen to run with 1 G.B of RAM.
It took less than 15 seconds for 15,000 numbers. so proportionally speaking, will it take about 40 minutes?
For a rough estimate, assuming no additional memory related overhead when scaling from 15k to 2.5 million, the runtime will be:
(2.5m * log(2.5m)) / (1.5k * log(1.5k)) * 15 seconds = 64 minutes
I don't know about heap sort and my eyes pop out when I'm looking at the original asker sorting 15 seconds for 15,000 numbers, or the accepted answer that accounts for that, but the default C++ STL sort of a vector of 2.5M integers takes me < 100ms. Compiled with -O3, of course, if not: < 700ms.

How to compute the time of an in-place external merge sort?

The original problem is like this:
You are to sort 1PB size of integers ranging from -2^31 ~ 2^31 - 1 (int), you have 1024 machines each having 1TB disk space and 16GB memory space. Assume disk speed is 128MB/s (r/w) and memory speed is 8GB/s (r/w). Time for CPU can be ignored. Network transfer time can be ignored for simplicity. Compute the approximated time needed.
I know with external sort we can sort the 1TB data on a single machine in roughly 10hrs as computed like this:
Disk access (2r2w): 1T * 4 / 128MB/s = 2 ^ 15 sec ~ 9 hrs
Mem access:
sorting 2^48 Integers in 64 parts (2 ^ 42 each) roughly takes 1.3 min each. So totally 1.4 hr.
63 way merging takes several seconds, and thus is ignored.
But what about the next step: the combination of 1024T data? I have no idea how this is computed. So any help please?
2^31 is = 2 billion (2 "giga"). So you are looking at lot of duplicate numbers and fixed range. So consider Radix Sort ( http://en.wikipedia.org/wiki/Radix_sort ).
Each processor, for a subset od data) creates 'count' array (x[0] contains the count of 0s etc). Then you can merge all results into one array. Later you can "construct" the sorted array.

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.
I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.
I want to optimize the time needed in processing the data, I would be measuring the overall execution time.
Let's say I have a calculation a = b * c * d.
If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X
If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X
So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?
Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?
Hope that someone would help me with this. Thanks in advance.
I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.
With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.
Hope that helps.
Edit Add timing.
Calculation in one 2t cycle. 10 calculations.
Time 0 2 2 2 2 2 2 2 2 2 2 = 20t
Input 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Calculation in two 1t cycles, pipelined, 10 calculations
Time 0 1 1 1 1 1 1 1 1 1 1 1 = 11t
Input 1 2 3 4 5 6 7 8 9 10
Stage1 1 2 3 4 5 6 7 8 9 10
Output 1 2 3 4 5 6 7 8 9 10
Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.
So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.
You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".
Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.
If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!
But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.
In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.
George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is
execution_time = (latency + (N * computation_cycles) ) * cycle_time
Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).
Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Resources