How does the function add_disk_randomness work? - random

I am trying to understand the add_disk_randomness function from the linux kernel. I read a few papers, but they don't describe it very well. This code is from /drivers/char/random.c:
add_timer_randomness(disk->random, 0x100 + disk_devt(disk));
disk_devt(disk) holds the variable disk->devt. I understand it as a unique block device number. But what are the major and minor device number?
Then the block device number and the hex 0x100 are added together.
Then we collect also the time value disk->random. Is this the seek time for each block?
These two values will be passed to the function add_timer_randomness. It would be nice to get an example with values.

The first parameter of add_timer_randomness is a pointer to struct timer_rand_state. You can confirm this by checking struct gendisk.
timer_rand_state from random.c is reproduced below
/* There is one of these per entropy source */
struct timer_rand_state {
cycles_t last_time;
long last_delta, last_delta2;
};
This struct stores the timestamp of the last input event as well as previous "deltas". add_timer_randomness first gets the current time (measured in jiffies), then reads last_time (also in jiffies), then overwrites last_time with the first value.
The first, second, and third order "deltas" are tracked as a means of estimating entropy. The main source of entropy from hard disk events is the timing of those events. More data is hashed into the entropy pool, but they don't contribute to entropy estimates. (It is important not to over estimate how unpredictable the data you hash in is. Otherwise your entropy pool, and therefore your RNG output too, may be predictable. Underestimating entropy on the other hand cannot make RNG output more predictable. It is always better to use a pessimistic estimator in this respect. That is why data that doesn't contribute to entropy estimates are still hashed into the entropy pool.)
Delta is the time between two events. (The difference between timestamps.) The second order delta is the difference between the times between two events. (Difference between deltas.) Third order deltas is differences between second order deltas. The timer_rand_state pointer is the memory location that tracks the previous timestamp and deltas. delta3 does not need to be stored.
The entropy estimate from this timing data is based on the logarithm of the largest absolute value of deltas one, two, and three. (Not exactly the logarithm. It's always an integer, for example. It's always rounded down by one bit. And if the value you're taking the almost-logarithm of is zero the result is also zero.)
Say you have a device used as an entropy source that generates a new events every 50 milliseconds. The delta will always be 50ms. The second order delta is always zero. Since one of the three deltas is zero this prevents this device's timings from being relied on as a significant entropy source. The entropy estimator successfully fails to overestimate input entropy, so even if this device is used as an entropy source it won't "poison" the entropy pool with predictability.
The entropy estimate isn't based on any formal mathematics. We can't construct an accurate model of the entropy source because we don't know what it is. We don't know what the hardware on a user's computer will be exactly or exactly how it will behave in an unknown environment. We just want to know that if we add one to the (estimated) entropy counter then we've hashed at least one bit of entropy worth of unpredictable data into the entropy pool. Extra data besides just the timings is hashed into the pool without increasing the entropy counter, so we hope that if the timer-based entropy estimator some time over estimates then maybe there is some unpredictability in the non-timer-based source we didn't account for. (And if that's the case your RNG is still safe.)
I'm sure that sounds unconvincing, but I don't know how to help that. I tried my best to explain the relevant parts of the random.c code. Even if I could mind meld and provide some intuition for how the process works it probably would still be unsatisfying.

Related

How to get truly random data, not random data fed into a PRNG seed like CSRNG's do?

From what I understand, a CSRNG like RNGCryptoServiceProvider still passes the truly random user data like mouse movement, etc through a PRNG to sort of sanitize the output and make it equal distribution. The bits need to be completely independent.
(this is for a theoretical infinite computing power attacker)
If the CSRNG takes 1KB of true random data and expands it to 1MB, all the attacker has to do is generate every combination of 1KB of data, expand it, and see which 1MB of data generates a one-time pad that returns sensible english output. I read somewhere that if the one-time pad had a PRNG anywhere in the RNG, it is just a glorified stream cipher. I was wondering if the truly random starting data was in large enough numbers to just use instead of cryptographically expanding. I need truly random output for a one-time pad, not just a cryptographically secure RNG. Or perhaps if there were other ways to somehow get truly random data, so that all bits are independent of each other. I was thinking of XOR'ing with the mouse coordinates for a few seconds, then perhaps the last digits of the Environment.TickCount, then maybe getting microphone input (1, 2, 3, 4) as well. However, as some point out on stackoverflow, I should really just let the OS handle it all. Unfortunately that isn't possible since there is an PSRNG used. I would like to avoid a hardware solution, since this is meant to be an easy to use program, and also not utilize RDRAND since it ALSO uses a PRNG (unless RDRAND can return the truly random data before it goes through a PRNG??). Would appreciate any responses if such a thing is even possible; I've been working on this for weeks under the impression that RNGCryptoServiceProvider was sufficient for a one time pad. Thanks.
(Side note: some say for most crypto functions you don't need true entropy, just unpredictability. for a one-time pad, it MUST be random otherwise it is not a one time pad.)
As you know, "truly random" means each of the bits is independent of everything else as well as uniformly distributed. However, this ideal is hard, if not impossible, to achieve in practice. In general, the closest way to get "truly random data" in practice is to gather hard-to-guess bits from nondeterministic sources, then condense those bits into a random block of data.
There are many issues involved with getting this close to "truly random data", including the following:
The sources must be nondeterministic, that is, their output cannot be determined by their inputs. Examples of nondeterministic sources include timings of input devices; thermal noise; and the noise registered by microphone and camera outputs.
The sources' output must be hard to guess. This is more formally known as entropy, such as 32 bits of entropy per 64 bits of output. However, measuring entropy is far from trivial. If you need 1 MB (8 million bits) of truly random data, you need to have data with at least 8 million bits of entropy (which in practice will be many times more than 1 MB long depending on the sources), then condense the data somehow into 1 MB of data while preserving that entropy.
The sources must be independent of each other.
There should be two or more independent sources. This is because it's impossible to extract full randomness from just one source (see McInnes and Pinkas 1990). On the other hand, extracting randomness from three or more independent sources is relatively trivial, but there is still a matter of choosing an appropriate randomness extractor, and a survey of randomness extractors would be beyond the scope of this answer.
In general, for random number generation purposes, the more sources available, the better.
REFERENCES:
McInnes, J. L., & Pinkas, B. (1990, August). On the impossibility of private key cryptography with weakly random keys. In Conference on the Theory and Application of Cryptography (pp. 421-435).

RNG seed selection highly affects simulation otucomes

I'm developing for the first time a simulation in Omnet++ 5.4 that makes use of the queueinglib library available. In particular, I've built a simple model that comprehends a server and some passive queues.
I've repeated the simulation different times, setting different seeds and parameters, as written in the Omnet Simulation Manual, here is my omnetpp.ini:
# specify how many times a run needs to be repeated
repeat = 100
# default: different seed for each run
# seed-set = ${runnumber}
seed-set = ${repetition} # I tried both lines
OppNet.source.interArrivalTime = exponential(${10,5,2.5}s)
This produces 300 runs, 100 repetitions for each value of the interArrivalTime distribution parameter.
However, I'm observing some "strange" behaviour, namely, the resulting statistics are highly variable according to the RNG seed.
For example, considering the lengths of the queues in my model, I have in the majority of the runs values smaller than 10, while in a few others a mean value that differs in orders of magnitude (85000, 45000?).
Does this mean that my implementation is wrong? Or is it possible that the random seed selection could influence the simulation outcomes so heavily?
Any help or hint is appreciated, thank you.
I cannot rule out that your implementation is wrong without seeing it, but it is entirely possible that you just happened to configure a chaotic scenario.
And in that case, yes, any minor change in the inputs (in this case the PRNG seed) might cause a major divergence in the results.
EDIT:
Especially considering a given (non-trivial) queuing network, if you are varying the "load" (number/rate of incoming jobs/messages, with some random distribution), you might observe different "regimes" in the results:
with very light load, there is barely any queuing
with a really heavy load (close, or at max capacity), most queues are almost always loaded, or even growing constantly without limit
and somewhere in between the two, you might get this erratic behavior, where sometimes a couple of queues suddenly grow really deep, then are emptied, then different queues are loaded up, and so on; as a function of either time, or the PRNG seed, or both - this is what I mean by chaotic
But this is just speculation...
Nobody can say whether your implementation is right or wrong without seeing it. However, there are some general rules that apply to queues which you should be aware of. You say that you're varying the interArrivalTime distribution parameter. A very important concept in queueing is the traffic intensity, which is the ratio of the interarrival rate to the service rate. If that ratio is less than one, the line length can vary by a great deal but in the long run there will be time intervals when the queue empties out because on average the server can handle more customers than arrive. This is a stable queue. However, if that ratio is greater than one, the queue will grow unboundedly. The longer you run the system, the longer the line will get. The thing that surprises many people is that the line will also go to infinity asymptotically when traffic intensity is equal to one.
The other thing to know is that the closer the traffic intensity gets to one for a stable queue, the greater the variability is. That's because the average increases, but there will always be periods of line length zero as described above. In order to have zeros always present but have the average increasing, there must be times where the queue length gets above the average, which implies the variability must be increasing. Changing the random number seed gives you some visibility on the magnitude of the variance that's possible at any slice of time.
Bottom line here is that you may just be seeing evidence that queues are weirder and more variable than you thought.

if a Bitcoin mining nounce is just 32 bits long how come is it increasingly difficult to find the winning hash?

I'm learning about mining and the first thing that surprised me is that the nounce part of the algorithm which is supposed to be randomly looped until you get a number smaller than the target hash .. is just 32 bits long.
Can you explain why then is it so difficult to loop an unsigned int and how come is it increasingly difficult over time? Thank you.
The task is: try different nonce values in your potential block until you reach a block having a hash value below some given threshold.
I can't find the source right now, but I'm quite sure that since the introduction of special mining ASICs the 32-bit nonce is no longer enough to keep the miners busy for the planned 10 minutes interval between blocks. They are able to compute 4 billion block hashes in less than 10 minutes.
Increasing the difficulty didn't help anymore, as that reached the point where none of the 4 billion possible nonce values gave a hash below the threshold.
So they found some additional fields in the block that are now used as nonce-extension. The principle is still the same: try different values until you reach a block with a hash below the threshold, only now it's more than 32 bits that can be varied, allowing for the threshold to be lowered beyond the former 32-bit-implied barrier.
Because it's not just the 32bit nonce that is involved in the calculation. The 1MB of transaction data is also part of the mining input. There is then a non-trivial amount of arithmetic to arrive at the output, which then can be compared with the target.
Bitcoin mining is looping over all 4billion uints until you find a "right" one.
The way that difficulty is increased, is that only some of the bits of the output matter. E.g. early on the lowest 11 bits had to be some specific pattern, the remaining 21bits could be anything. In theory there would be 2million "right" values for each transaction block, uniformly distributed across the range of a uint. Then the "difficulty" is increased so that 13 bits have to be some pattern, so now there are 4x fewer "right" answers, so it takes (on average) 4x longer to find one.

How can a pseudorandom number generator possibly be non-repeating?

My understanding is that PRNG's work by using an input seed and an algorithm that converts it to a very unrelated output, so that the next generated number is as unpredictable as possible. But here's the problem I see with it:
Any pseudorandom number generator that I can imagine has to have a finite number of outcomes. Let's say that I'm using a random number generator that can generate any number between 0 and one hundred billion. If I call for an output one hundred billion and one times, I can be certain that one number has been output more than once. If the same seed will always give the same output when put through an algorithm, then I can be sure that the PRNG will begin a loop. Where is my logic flawed here?
In the case that I am correct, if you know the algorithm for a PRNG, and that PRNG is being used for cryptography, can not this approach be used (and are there any measures in place to prevent it?):
Use the PRNG to generate the entire looping set of numbers possible.
Know the timestamp of when a private key was generated, and know the time AND -output of the PRNG later on
Based on how long it takes to calculate, determine how many numbers are between the known output and the unknown one
Lookup in the pre-generated list to find the generated number
You are absolutely right that in theory that approach can be used to break a PRNG, since, as you noted, given a sufficiently long sequence of outputs, you can start to predict what comes next.
The issue is that "sufficiently long" might be so long that this approach is completely impractical. For example, the Mersenne twister PRNG, which isn't designed for cryptographic use, has a period of 219,937 - 1, which is so long that it's completely infeasible to attempt the attack that you're describing.
Generally speaking, imagine that a pseudorandom generator uses n bits of internal storage. That gives 2n possible internal configurations of those bits, meaning that you may need to see 2n + 1 outputs before you're guaranteed to see a repeat. Given that most cryptographically secure PRNGs use at least 256 bits of internal storage, this makes your attack infeasible.
One detail worth noting is that there's a difference between "the PRNG repeats a number" and "from that point forward the numbers will always be the same." It's possible that a PRNG will repeat an output multiple times before moving on to output a different number next, provided that the internal state is different each time.
You are correct, a PRNG produces a long sequence of numbers and then repeats. For ordinary use this is usually sufficient. Not for cryptographic use, as you point out.
For ideal cryptographic numbers, we need to use a true RNG (TRNG), which generates random numbers from some source of entropy (= randomness in this context). Such a source may be a small piece of radioactive material on a card, thermal noise in a disconnected microphone circuit or other possibilities. A mixture of many different sources will be more resistant to attacks.
Commonly such sources of entropy do not produce enough random numbers to be used directly. That is where PRNGs are used to 'stretch' the real entropy to produce more pseudo random numbers from the smaller amount of entropy provided by the TRNG. The entropy is used to seed the PRNG and the PRNG produces more numbers based on that seed. The amount of stretching allowed is limited, so the attacker never gets a long enough string of pseudo-random numbers to do any worthwhile analysis. After the limit is reached, the PRNG must be reseeded from the TRNG.
Also, the PRNG should be reseeded anyway after every data request, no matter how small. There are various cryptographic primitives that can help with this, such as hashes. For example, after every data request, a further 128 bits of data could be generated, XOR'ed with any accumulated entropy available, hashed and the resulting hash output used to reseed the generator.
Cryptographic RNGs are slower than ordinary PRNGs because they use slow cryptographic primitives and because they take extra precautions against attacks.
For an example of a CSPRNG see Fortuona
It's possible to create truly random number generators on a PC because they are undeterministic machines.
Indeed, with the complexity of the hierarchical memory levels, the intricacy of the CPU pipelines, the coexistence of innumerable processes and threads activated at arbitrary moments and competing for resources, and the asynchronism of the I/O devices, there is no predictable relation between the number of operations performed and the elapsed time.
So looking at the system time every now and then is a perfect source randomness.
Any pseudorandom number generator that I can imagine has to have a finite number of outcomes.
I don't see why that's true. Why can't it have gradually increasing state, failing when it runs out of memory?
Here's a trivial PRNG algorithm that never repeats:
1) Seed with any amount of data unknown to an attacker as the seed.
2) Compute the SHA512 hash of the data.
3) Output the first 256 bits of that hash.
4) Append the last byte of that hash to the data.
5) Go to step 2.
And, for practical purposes, this doesn't matter. With just 128 bits of state, you can generate a PRNG that won't repeat for 340282366920938463463374607431768211456 outputs. If you pull a billion outputs a second for a billion years, you won't get through a billionth of them.

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...
Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.
If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Resources