Binomial Random Variate Generator on CUDA - random

My problem is the following:
I need to generate lot of random numbers in parallel using Binomial Distribution on CUDA. All the Random Number Generators on CUDA are based on the Uniform Distribution (as far I know), what is also useful since all the algorithms for Binomial Distribution needs to use Uniform variates.
Is there any library or implementation for binomial random variate generation on CUDA? I see that there are for JAVA in http://acs.lbl.gov/~hoschek/colt/ , but it uses a very complicated algorithm to be parallelized. However, given a binomial variate following B(N,p), there are simpler algorithms with order of complexity O(N), but it is bad for me because N can be large (around 2^32, maximum for a Integer).
I would appreciate any help. Thanks a lot.
Miguel
P.S.: sorry for my bad english :)

That's an interesting problem, I would attack the problem by using a previous solution and adapting it to the way CUDA works..
CiteSeerX is where you can get hold of pdf's for research that might help..
http://citeseerx.ist.psu.edu/
Did you take a look at MDGPU? It was suggested in another question in SO
http://www-old.amolf.nl/~vanmeel/mdgpu/licence.html
Also NAG have a library which may help:
http://www.nag.co.uk/numeric/gpus/

Related

Understanding Harvey & van der Hoeven 2019 algorithm (huge integer multiplication)

I need to multiply several big long integers as efficiently as possible.
I am trying to implement the Harvey & van der Hoeven 2019 algorithm for integer multiplication, but I am stuck on understanding the definition and mathematics behind it, especially the Agarwal–Cooley algorithm.
Any help to understand this algorithm, like a practical example or some pseudo-code would be highly appreciated.
Remember that Big O notation is defined such that there exists some x≥x₀ for which some function |f(x)|≤εg(x) for all such x.
The problem with the Harvey & van der Hoeven (2019) algorithm is that the x₀ involved is quite large. Therefore, for most inputs, their algorithm gives a way to multiply integers inefficiently. For very large, numbers, though, the algorithm does give an O(n log n) algorithm.
But how big are those numbers? David Harvey, one of the authors states:
The new algorithm is not really practical in its current form, because the proof given in our paper only works for ludicrously large numbers. Even if each digit was written on a hydrogen atom, there would not be nearly enough room available in the observable universe to write them down.
On the other hand, we are hopeful that with further refinements, the algorithm might become practical for numbers with merely billions or trillions of digits. If so, it may well become an indispensable tool in the computational mathematician's arsenal.
Therefore, if you are serious about your stated goal---multiplying big numbers quickly---this algorithm is not the way you should go about doing it.
If your long integers are less than about 10000 bits and you are using a regular 32- or 64 bit computer, I suggest Karatsuba-Offman. It can be sped up using parallelism, e.g. multi-threading or a GPU.
If you want to make a custom chip to do it fully parallel, use 4XY = (X+Y)^2-(X-Y)^2 and build a Karatsuba-Offman squarer. That takes less chip area because the squarer has only n input lines instead of 2n

Speed of a PRNG

Is there a specific algorithm or method of calculating the speed of a pseudo random number generator?
I've recently made a PRNG, and for my last question here I learned that the Big-O analysis is not suitable in my situation.
I want to compare my program's speed to a well known pseudo random number generator, but I can't find any valuable information related to it.

Fourier transformation Algorithms

Please do bear with me if you find my query a little stupid. But I am currently doing a high school research project on how Fourier transformation can be used in recognizing human speech(similar to how Shazam works). But I need to two different Fast Fourier Transformation algorithms for this project. One of the algorithms I am using would definitely be the Cooley-Tukey FTT algorithm. However, I am unsure of another FTT algorithm I should use. Thus, what would be a good algorithm to use and is there any pseudo code/source code for that particular algorithm? I was only able to find algorithms for Cooley-Tukey thus far.
Thanks!
If you don't need speed (due to some performance constraints), then a DFT (straight matrix multiply) should produce very similar results (differing due to rounding noise) using a very different algorithm.

Where is the Sieve of Eratosthenes used today?

I'm doing a research paper on the topic and while I find a lot of examples and discussion about how the algorithm works/should be implemented, I can't find anything on where it's actually used.
Is there any field in which the algorithm is used today? Or do people just implement it for "shits 'n giggles" (it's fairly simple, so that would make some sense)?
I know that large prime numbers are important in the field of encryption, but I doubt the sieve is used to find/generate those primes. Also, the huge amount of memory needed to find large primes makes it inefficient for those, too.
So is the algorithm, in any form, used anywhere today?
According to the Wikipedia article on the subject, that particular sieve is still a very efficient method for producing the full list of primes whose value is less than a few millions. Also, the general idea of a sieve is used in several other, more powerful algorithms, such as the General number field sieve for factoring large integers.
You can view a prime sieve as an application of dynamic programming to small complete prime number enumeration and testing. So your question is really "what do we need prime numbers for?". They are a fundamental part of number theory. As one example encoding an integer into its prime factorization has all sorts of useful properties and higher-level utility. By adding backtracking to a sieve we can perform this factorization very quickly.

Strassen's Algorithm proof

I have been reading about the Strassen Algorithm for matrix multiplication.
As mentioned in Introduction to Algorithms by Cormen , the algorithm is not intuitive. However I am curious to know if there exists any rigorous mathematical proof of the algorithm and what actually went into the design of the algorithm.
I tried searching on Google and stackoverflow, but all links are only on comparing Strassen's approach to standard matrix multiplication approach or they elaborate on the procedure presented by the algorithm.
You should go to the source material. In this case, the original paper by Strassen:
Strassen, Volker, Gaussian Elimination is not Optimal, Numer. Math. 13, p. 354-356, 1969
http://link.springer.com/article/10.1007%2FBF02165411?LI=true
Even though I haven't read it myself, I would assume that there is a rigorous discussion and proof of the complexity of the algorithm.
It looks like Professor Strassen is still active (http://en.wikipedia.org/wiki/Volker_Strassen) and has a home page (http://www.math.uni-konstanz.de/~strassen/). If, after learning as much as you can about the algorithm, you are still interested in learning more, I don't think a carefully worded email to the professor would be out of the question.
Unfortunately, there does not seem to be a free version of the paper available online despite the fact that the work was completed at a public university (UC Berkeley) using federal funds (NSF grant), but that is a completely separate issue we shouldn't discuss here.
If you are a student, you will likely have access via your school, or at least your school could get you a copy without cost to you. Good luck.
The proof that Strassen's algorithm should exist is a simple dimension count (combined with a proof that the naive dimension count gives the correct answer). Consider the vector
space of all bilinear
map $C^n\times C^n \rightarrow C^n$, this is a vector space of dimension $n^3$ (in the case of matrix multiplication, we have $n=m^2$, e.g. $n=4$ for the $2\times 2$ case). The set of bilinear
maps of rank one, i.e., those computable in an algorithm using just one scalar multiplication, has dimension $3(n-1)+1$ and the set of bilinear maps of rank at
most $r$ has dimension the min of $r[3(n-1)]+r$ and $n^3$ for most values of $n,r$ (and one can check that
this is correct when $r=7,n=4$. Thus any bilinear map $C^4\times C^4\rightarrow C^4$,
with probability one has rank at most $7$, and may always be approximated to arbitrary
precision by a bilinear map of rank at most $7$.

Resources