riscv and hardware random generator

riscv and hardware random generator - random

I don't know much about RISCV (actually thinking about Open RISC) and what little I have read tells it is about creating and optimizing things. Now people (FSF for instance) has suspected time and again that security forces may be giving weak random number generators and whatever passwords are generated could be broken by them easily.
While I don't know whether to believe that or not, from what little I read, it seemed RISC might be a platform which may make and have lot of improvements in random number generator (creation and secrecy both). Is this true or just fanciful thoughts ?

RISC-V and OpenRISC are Instruction Set Architectures (ISA) and have nothing to do with random number generators. Although an ISA can have dedicated instructions for generating random numbers.
Everything that is security related can be attacked in different ways like side channel attacks or timing attacks. Those attacks depend on the hardware and software implementations and not on the ISA.

Related

Random number generation / which algorithm?

We need to migrate to a better RNG or RBG for some key value generation which will be further used for encryption of the data.
Which will be the most suitable algorithm? Shall I consider NIST doc for this?

Any pseudo random number generator that produces a Gaussian distribution and that has a wide output (say at least 32 bits) should be enough for creating keys. It's up to you to determine your needs and then find a matching RNG.
For more info, see http://www.random.org/randomness.
Depending on the language you choose to implement this, I'm sure you can find source code for pseudo-RNG on the Web, if the one built-in into your system isn't good enough.

As we are a programming site, I would seriously look at the secure random number generators at your disposal in your particular runtime environment. In general you will have to rely on system resources to generate randoms, at least to seed the pseudo random number generator. The only possible exception are CPU specific random instructions, such as the ones used on the latest Intel CPU's (hopefully well-tested secure RNGs will become a main feature of CPU's).
Within many programming environments there is very little choice but to use OpenSSL or /dev/random for seeding. In general it is hard to find useful information about the random number generator. Sometimes the RNG is really not suitable at all (e.g. the native PHP version).
If possible, try to find something that conforms to NIST requirements.

Most effective method to use parallel computing on different architectures

I am planning to write something to take advantages of the many devices that I have at home.
Basically my aim is to use the laptop to execute calculations, and also to use my main desktop computer to add more power (and finish the task quicker). I work with cellular simulation and chemical interactions, so to me would be great to take advantage of all that I have available at home.
I am using mainly OSX, so I need something that may work with that OS. I can code in objective-C, C and C++.
I am aware of GCD, OpenCL and MPI, but I am not sure which way to go.
I was planning to not use the full power of my desktop but only some of the available cores (in this way I can continue to work on the desktop doing other tasks that are not so resource intensive). In particular I would love to use the graphic card power (it is an ATI card, so no CUDA), since all that I do mainly is spreadsheet, word and coding with Xcode, and the graphic card resources are basically unused in that scenario.
Is there a specific set of libraries or API, among the aforementioned 3, that would allow me to selectively route tasks, and use resources on another machine without leaving the control totally to the compiler? I've heard that GCD is great but it has very limited control on where the blocks are executed, while MPI is on the other side of the spectrum....OpenCL seems to be in the middle.
Before diving in one of these technologies I would like to know which one would most likely suit my needs; I am sure that some other researcher has already used successfully parallel computing to achieve what I am trying to achieve.
Thanks in advance.

MPI is more for scientific computing large scale many processors many nodes exc not for a weekend project, for what you describe I would suggest using OpenCl or any one the more distributed framework of AMQP protocol families, such as zeromq or rabbitMQ, or a combination of OpenCl and AMQP , or even simpler consider multithreading , i would suggest OpenMP for that. I'm not sure if you are looking for direct solvers or parallel functions but there are many that exist as well for gpu's and cpu's which you can find on the web

Sorry, but this question simply cannot be meaningfully answered as posed. To be sure, I could toss out a collection of buzzwords describing various technologies to look at like GCD, OpenMPI, OpenCL, CUDA and any number of other technologies which allow one to run a single program on multiple cores, multiple programs on different cooperating computers, or a single program distributed across CPU and GPU, and it sounds like you know about a number of those already so I wouldn't even be adding much value in listing the buzzwords.
To simply toss out such terms without knowing the full specifics of the problem you're trying to solve, however, is a bit like saying that you know English, French and a little German so sure, by all means - mix them all together in a single paragraph without knowing anything about the target audience! Similarly, you can parallelize a given computation in any number of ways, across any number of different processing elements, but whether that parallelization is actually a win or not is going to be entirely dependent on the nature of the algorithm, its data dependencies, how much computation is expected for each reasonable "work chunk", and whether it can be executed on a GPU with sufficient numeric precision, among many other factors. The more complex the technology you choose, the more those factors matter and the greater the possibility that the resulting code will actually be slower than its single-threaded, single machine counterpart. IPC overhead and data copying can, and frequently do, swamp all of the gains one might realize from trying to naively parallelize something and then add additional overhead on top of that, resulting in a net loss. This is why engineers who can do this kind of work meaningfully and well are in such high demand. :)
Without knowing anything about your calculations, I would move in baby steps. First try a simple multi-processor framework like GCD (which is already built in to OS X and requires no additional dependencies to use) and figure out how to factor your code such that it can effectively use all of the available cores on a single machine. Once you've learned where the wins are (and if there even are any - if multi-threading isn't helping, multi-machine parallelization almost certainly won't either), try setting up several instances of the calculation on several machines with a simple IPC model that allows for distributing the work. Having already factored your algorithm(s) for multiple threads, it should be comparatively straight-forward to further generalize the approach across multiple machines (though it bears noting that the two are NOT the same problem and either way you still want to use all the cores available on any of the given target machines, so the two challenges are both complimentary and orthogonal).

Where to learn about low-level, hard-core performance stuffs?

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)

The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)

EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.

Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs

Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.

I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

Safe mixing of entropy sources

Let us assume we're generating very large (e.g. 128 or 256bit) numbers to serve as keys for a block cipher.
Let us further assume that we wear tinfoil hats (at least when outside).
Being so paranoid, we want to be sure of our available entropy, but we don't entirely trust any particular source. Maybe the government is rigging our coins. Maybe these dice are ever so subtly weighted. What if the hardware interrupts feeding into /dev/random are just a little too consistent? (Besides being paranoid, we're lazy enough that we don't want to generate it all by hand...)
So, let's mix them all up.
What are the secure method(s) for doing this? Presumably just concatenating a few bytes from each source isn't entirely secure -- if one of the sources is biased, it might, in theory, lend itself to such things as a related-key attack, for example.
Is running SHA-256 over the concatenated bytes sufficient?
(And yes, at some point soon I am going to pick up a copy of Cryptography Engineering. :))

Since you mention /dev/random -- on Linux at least, /dev/random is fed by an algorithm that does very much what you're describing. It takes several variously-trusted entropy sources and mixes them into an "entropy pool" using a polynomial function -- for each new byte of entropy that comes in, it's xor'd into the pool, and then the entire pool is stirred with the mixing function. When it's desired to get some randomness out of the pool, the entire pool is hashed with SHA-1 to get the output, then the pool is mixed again (and actually there's some more hashing, folding, and mutilating going on to make sure that reversing the process is about as hard as reversing SHA-1). At the same time, there's a bunch of accounting going on -- each time some entropy is added to the pool, an estimate of the number of bits of entropy it's worth is added to the account, and each time some bytes are extracted from the pool, that number is subtracted, and the random device will block (waiting on more external entropy) if the account would go below zero. Of course, if you use the "urandom" device, the blocking doesn't happen and the pool simply keeps getting hashed and mixed to produce more bytes, which turns it into a PRNG instead of an RNG.
Anyway... it's actually pretty interesting and pretty well commented -- you might want to study it. drivers/char/random.c in the linux-2.6 tree.

Using a hash function is a good approach - just make sure you underestimate the amount of entropy each source contributes, so that if you are right about one or more of them being less than totally random, you haven't weakened your key unduly.
This isn't dissimilar to the approach used in key stretching (though you have no need for multiple iterations here).

I've done this before, and my approach was just to XOR them, byte-by-byte, against each other.
Running them through some other algorithm, like SHA-256, is terribly inefficient, so it's not practical, and I think it would be not really useful and possibly harmful.
If you do happen to be incredibly paranoid, and have a tiny bit of money, it might be fun to buy a "true" (depending on how convinced you are by Quantum Mechanics) a Quantum Random Number Generator.
-- Edit:
FWIW, I think the method I describe above (or something similar) is effectively a One-Time Pad from the point of view of either sources, assuming one of them is random, and therefore unattackable assuming they are independant and out to get you. I'm happy to be corrected on this if someone takes issue with it, and I encourage anyone not taking issue with it to question it anyway, and find out for yourself.

If you have a source of randomness but you're not sure whether it is biased or not, then there are a lot of different algorithms. Depending on how much work you want to do, the entropy you waste from the original source differes.
The easiest algorithm is the (improved) van Neumann algorithm. You can find the details in this pdf:
http://security1.win.tue.nl/~bskoric/physsec/files/PhysSec_LectureNotes.pdf
at page 27.
I also recommend you to read this document if you're interested in how to produce uniformly randomness from a given souce, how true random number generators work, etc!

Justification for using non-portable code

How does one choose if someone justify their design tradeoffs in terms of optimised code, clarity of implementation, efficiency, and portability?
A relevant example for the purpose of this question could be large file handling, where a "large file" is "quite a few GB" for a problem that would be simplified using random-access methods.
Approaches for reading and modifying this file could be:
Use streams anyway, and seek to the desired place - which is portable, but potentially slow, and is not clear - this will work for practically all OS's.
map the relevant portion of the file as a large block. Eg, mmap a 50MB chunk of the file for processing, for each chunk - This would work for many OS's, depending on the subtleties of implementing mmap for that system.
Just mmap the entire file - this requires a 64-bit OS and is the most efficient and clear way to implement this, however does not work on 32-bit OS's.

Not sure what you're asking, but part of the design process is to analyze requirements for portability and performance (amongst other factors).
If you know you'll never need to port the code, and you need absolutely the best performance, then you adjust your implementation accordingly. There's no point being portable just for its own sake.
Note also that if you want both performance and portability, there's nothing stopping you from providing an implementation for each platform. Of course this will increase your cost, so really, its up to you to prioritize your needs.

Without constraints, this question rationally cannot be answered rationally.
You're asking "what is the best color" without telling us whether you're painting a house or a car or a picture.
Constraints would include at least
Language of choice
Target platforms (multi CPU industrial-grade server or iPhone?)
Optimizing for speed vs. memory
Cost (who's funding this and is there a delivery constraint?)
No piece of software could have "ultimate" portability.
An example of this sort of problem being handled using a variety of methods but with a tight constraint both on the specific input/output required and the measurement of "best" would be the WideFinder project.

Basically, you need think first before coding. Every project is unique and an analysis of the needs could help decide what is primordial for it. What will make the best solution for any project depends on a few things...
First of all, will this project need to be or eventually be multiplatform? Depending on your choice, choosing the right programming language should be easier. Then again you could also use more than one language in your project and this is completely normal. Portability does not necessarily mean less performance. All it implies is that it involves harder work to achieve your goals because you will need quality code. Also, every programming language has its own philosophy. Learn what they are. One thing is for sure, certain problems frequently come back over and over. This is why knowing the different design patters can make a difference sometimes, but some languages have their own idioms and can be very relevant when choosing a language. Another thing that needs some thought is the different approaches that you can have for your project. Multithreading, sockets, client/server systems and many other technologies are all there for you to use. Choosing the right technology can help to make a project better.
Knowing the needs and the different solutions available today is what will help decide when comes the time to choose for the different tradeoffs.

It really depends on the drivers for the project. If you are doing in-house enterprise dev, then do the simplest thing that could work on your target hardare. Mod for performance reqs as needed.
If you know you need to support different hardware platforms on day 1, then you'll clearly need to choose a portable implementation, or use multiple approaches.

Portability for portability's sake has been a marketing spiel for Java since inception and is a fact of life for C by convention, and I believe most people who abide by it "grew up" with Java or C will say that.
However true, absolute portability will only be true for the most trivial to at most applications with medium complexity -- anything with high complexity will need specialized tweaks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio