Is generating random numbers from hardware performance cryptographically secure? - random

Suppose I have a program that needs an RNG.
If I were to run arbitrary operations and check the ∆t it takes to do said operations, I could generate random numbers from that
For example:
double start = device.time();
for(int i=0;i<100;i++);//assume compiler doesn't optimize this away
double end = device.time();
double dt = end-start;
dt will be more or less random based on many variables on the device such as battery level, transistor age, room temperature, other processes running, etc.
Now, suppose I keep generating dts and multiply them together as I go, hundreds of times, thousands of times, millions of times, eventually I am left with a very arbitrary number based on values that were more or less randomly calculated by hardware performance benchmarking.
Every time I multiply these dts together, the possible outputs increases exponentially, so determining what possible outputs may be becomes perhaps an impossible task after millions of iterations of this, even if each individual dt value is going to be within a similar range.
A thought then occurs, if you have a very consistent device, you may have dt always in the range of say 0.000000011, 0.000000012, 0.000000013, 0.000000014, then the final output number, no matter how many times I iterate and multiply, will be a number of the form 0.000000011^a * 0.000000012^b * 0.000000013^c * 0.000000014^d, that's probably easy to crack.
But then I turn to hashing, suppose rather than multiplying each dt, I concatenate it in string form to the previous values and hash them, so every time I generate a new dt based on hardware performance's random environmental values, I hash. Then at the end I digest the hash to whatever form I need, now the final output number can't be written in a general algebraic form.
Will numbers generated in this form be cryptographically secure?

Using a clock potentially leaks information to an adversary. Using a microphone also -- the adversary may have planted a bug and is hearing the same input. Best not to rely on any single source but to combine entropy inputs from multiple sources, both external to your computer and internal. By all means use internal OS entropy sources, such as dev/urandom, but use other sources as well.
It might be worth reading the description of the Fortuna CSPRNG for ideas.

If you take enough samples, and use few enough low bits of the time difference, then maybe timing of async interrupts could eventually add up to a useful amount of entropy.
Most OS kernels will collect entropy from timing in their own interrupt handlers, as part of the source for /dev/urandom, but if you really want to roll your own instead of asking the OS for randomness, it's plausible if you're very careful with your mixing function. e.g. have a look at what the Linux kernel uses for mixing in new data into its entropy pool. It has to avoid being "hurt" by sources that aren't actually random on a given system.
Other than interrupts, performance over short times is nearly deterministic, and CPU frequency variations are quantized into not that many different frequencies.
based on many variables on the device such as ...
battery level: maybe a 2-state effect like limiting max turbo when not on AC power, and/or when the battery is low.
transistor age: no. At most an indirect effect if aged transistors use more power, leading to the CPU running hotter and dropping out of max turbo sooner. I'm not sure there's any significant effect.
room temperature: again, only possibly reducing max clock speed sooner. Unless you're wasting multiple seconds of CPU time on this, it won't have an effect even on lightweight laptops. Desktops typically have enough cooling to sustain max turbo indefinitely on a single core, especially for simple scalar code. (SIMD FMA would make a lot more heat.)
other processes running: yes, that and async interrupts that happen to come inside your timed intervals will be the main source of randomness.
Most of the factors that affect clock speed will just uniformly scale up all times, correlated between samples, not more entropy. Clock frequency doesn't change that often; after ramping up to full speed for your benchmark loops, expect it to stay constant for multiple seconds.

Related

Computing vs printing

In my program, I made few modifications for performance improvement.
First, I undid some 3D point computations as it was a repetitive computation.
Second, I undid some print statements.
What I observe is that second change substantially improved the performance, while first one not so much.
Does it mean computations involving floating numbers are much less expensive than printing out some data to console? Is not floating point mathematics considered to be highly computation extensive?
Floating-point arithmetic is often more expensive than integer arithmetic, in terms of processor cycles and/or the space required for it in the silicon of processors and/or the energy required for it. However, printing is generally much more expensive.
Typical performance for floating-point additions or multiplications might be a latency of four processor cycles, compared to one for integer additions or multiplications.
Formatting output requires many instructions. Converting numbers to decimal requires dividing or performing table-lookups or executing other algorithms. The characters generated to represent a number must be placed in a buffer. Checks must be performed to ensure that internal buffers are not overflowed. When a buffer is full, or a printing operation is complete and must be sent to the output device (rather than just merely held in a buffer for future operations), then an operating system call must be performed to transfer the data from user memory to some input-output driver. Even simple in-buffer formatting operations may take hundreds of cycles, and printing that requires interaction with the file system or other devices may take thousands of cycles. (The actual upper limit is infinite, since printing may have to wait for some physical device to become ready. But even if all the activity of a particular operation is inside the computer itself, a print operation may take thousands of cycles.)

Is there a constant-time algorithm for generating a bandlimited sawtooth?

I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample. This puts some interesting restrictions on what algorithms can be used - any algorithm that refers to a previous set of samples cannot be implemented in this fashion.
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
This makes synthesizing bandlimited waveforms difficult. One approach is additive synthesis of partials using the fourier series. However, this runs at O(n) time, and is especially slow on a GPU to the point that the gain of parallelism is lost. If there were an algorithm that ran at O(1) time, this would eliminate branching AND be up to 1000x faster when dealing with the audible range.
I'm specifically looking for something like a DSF for a sawtooth. I've been trying to work out a simplification of the fourier series by hand, but that's really, really hard. Mainly because it involves harmonic numbers, AKA the only singularity of the Riemann-Zeta function.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
Filtering is one of those algorithms. Bandpass, lowpass, or highpass - all of them require looking to the last few samples generated in order to compute the result. This can't be done because those samples haven't been generated yet.
That's not right. IIR filters do need previous results, but FIR filters only need previous input; that is pretty typical for the things that GPUs were designed to do, so it's not likely a problem to let every processing core access let's say 64 input samples to produce one output sample -- in fact, the cache architectures that Nvidia and AMD use lend themselves to that.
Is a constant-time algorithm achievable? If not, can it be proven that it isn't?
It is! In two aspects:
as mentioned above, FIR filters only need multiple samples of immutable input, so they can be parallelized heavily without problems, and
even if you need to calculate your input first, and would like to parallelize that (I don't see a reason for that -- generating a sawtooth is not CPU-limited, but memory bandwidth limited), every core could simply calculate the last N samples -- sure, there's N-1 redundant operations, but as long as your number of cores is much bigger than your N, you will still be faster, and every core will have constant run time.
Comments on your approach:
I'm looking into the feasibility of GPU synthesized audio, where each thread renders a sample.
From a higher-up perspective, that sounds too fine-granular. I mean, let's say you have 3000 stream processors (high-end consumer GPU). Assuming you have a sampling rate of 44.1kHz, and assuming each of these processors does only one sample, letting them all run once only gives you 1/14.7 of a second of audio (mono). Then you'd have to move on to the next part of audio.
In other words: There's bound to be much much more samples than processors. In these situations, it's typically way more efficient to let one processor handle a sequence of samples; for example, if you want to generate 30s of audio, that'd be 1.323MS (amples). Simply splitting the problem into 3000 chunks, one for each processor, and giving each of them the 44100*30/3000=441 samples they should process plus 64 samples of "history" before the first of their "own" samples will still easily fit into local memory.
Yet another thought:
I'm coming from a software defined radio background, where there's usually millions of samples per second, rather than a few kHz of sampling rate, in real time (i.e. processing speed > sampling rate). Still, doing computation on the GPU only pays for the more CPU-intense tasks, because there's significant overhead in exchanging data with the GPU, and CPUs nowadays are blazingly fast. So, for your relatively simple problem, it might never work faster to do things on the GPU compared to optimizing them on the CPU; things of course look different if you've got to process lots of samples, or a lot of streams, at once. For finer-granular tasks, the problem of filling a buffer, moving it to the GPU, and getting the result buffer back into your software usually kills the advantage.
Hence, I'd like to challenge you: Download the GNU Radio live DVD, burn it to a DVD or write it to a USB stick (you might as well run it in a VM, but that of course reduces performance if you don't know how to optimize your virtualizer; really - try it from a live medium), run
volk_profile
to let the VOLK library test which algorithms work best on your specific machine, and then launch
gnuradio-companion
And then, run open the following two signal processing flow graphs:
"classical FIR": This single-threaded implementation of the FIR filter yields about 50MSamples/s on my CPU.
FIR Filter implemented with the FFT, running on 4 threads: This implementation reaches 160MSamples/s (!!) on my CPU alone.
Sure, with the help of FFTs on my GPU, I could be faster, but the thing here is: Even with the "simple" FIR filter, I can, with a single CPU core, get 50 Megasamples out of my machine -- meaning that, with a 44.1kHz audio sampling rate, per single second I can process roughly 19 minutes of audio. No copying in and out of host RAM. No GPU cooler spinning up. It might really not be worth optimizing further. And if you optimize and take the FFT-Filter approach: 160MS/s means roughly one hour of audio per processing second, including sawtooth generation.

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Is it possible to find hotspots in a parallel application using a sampling profiler?

As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:
ratio of the function = C / N
Finding the hotspots then is easy, as this are the parts of the program with a high ratio.
But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.
Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?
You're on the right track.
Whether you need to sample all the threads depends on whether they are doing the same thing or different things.
It is not essential to sample them all at the same time.
You need to look at the threads that are actually working, not just idling.
Some points:
Sampling should be on wall-clock time, not CPU time, unless you want to be blind to needless I/O and other blocking calls.
You're not just interested in which functions are on the stack, but which lines of code, because they convey the purpose of the time being spent. It is more useful to look for a "hot purpose" than a "hot spot".
The cost of a function or line of code is just the fraction of samples it appears on. To appreciate that, suppose samples are taken every 10ms for a total of N samples. If the function or line of code could be made to disappear, then all the samples in which it is on the stack would also disappear, reducing N by that fraction. That's what speedup is.
In spite of the last point, in sampling, quality beats quantity. When the goal is to understand what opportunities you have for speedup, you get farther faster by manually scrutinizing 10-20 samples to understand the full reason why each moment in time is being spent. That's why I take samples manually. Knowing the amount of time with statistical precision is really far less important.
I can't emphasize enough the importance of finding and fixing more than one problem. Speed problems come in severals, and each one you fix has a multiplier effect on those done already. The ones you don't find end up being the limiting factor.
Programs that involve a lot of asynchronous inter-thread message-passing are more difficult, because it becomes harder to discern the full reason why a moment in time is being spent.
More on that.

For parallel algorithm with N threads, can performance gain be more than N?

A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.

Resources