Exploit batching (drawing many samples at once) for random sampling

Exploit batching (drawing many samples at once) for random sampling - random

Suppose I have large buffers of uniformly random bytes (entropy source).
I want to use it to draw many samples (e.g. 10^7 at a time) from a fixed (rational) probability distribution over a finite set (e.g., 8 elements).
I need
a theoretical guarantee that the specified distribution is reproduced exactly.
to be reasonably efficient with the random bits. E.g., if the Shannon entropy H of my distribution (over 8 symbols) is around 2.3 and I would like to use at most 3 bits from my stream on average to draw a sample. Even better would be, say, within 20% of the Shannon limit.
to sample quickly. At the very least 100 Mbyte/sec on "one core of a standard processor".
reasonable RAM usage (not counting stored sampling results) below, say, 200MB
I do not care about the runtime of pre-computations that need to be done once per distribution.
There are very many algorithms and implementations to chose from and I'm having trouble comparing them in terms of trading off entropy consumption, speed and memory. I found an overview in this SO-question. There are also many papers comparing algorithms (e.g. arXiv:1502.02539v6) and new algorithms being proposed (e.g. the "Fast Loaded Dice Roller" arXiv:2003.03830v2).
Knuth and Yao show that any optimal (in terms of entropy consumption) algorithm (that spits out one sample at a time) consumes between H and H+2 bits of entropy. By drawing multiple samples (i.e., sampling from the product distribution), one can get closer to the Shannon limit of using H bits per sample on average. This is sometimes called "batching".
My first instinct would thus be to use, say, the available C implementation of the "Fast Loaded Dice Roller" after "packing" my symbols to get a distribution over a range of integers that fit into one (or a few) bytes. However, the descriptions of these algorithms don't seem to focus on "batching". I wonder, perhaps other methods can be more efficient (in either entropy consumption, speed or memory needs) by making use of my large batch sizes?

Related

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?

200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

Prevent inbreeding and monoculture in genetic algorithm (newbie question)

I am writing a genetic algorithm. My population quickly develops a monoculture. I am using a small population (32 individuals) with a small number of discrete genes (24 genes per individual) and a single point cross-over mating approach. Combine that with a roulette wheel selection strategy and it is easy to see how all the genetic diversity is lost in just a few dozen generations.
What I would like to know is, what is the appropriate response? I do not have academic-level knowledge on GAs and only a few solutions come to mind:
Use a larger population. (slow)
Use runtime checks to prevent in-breeding. (slow)
Use more cross-over points. (not very effective)
Raise the number of mutations.
What are some appropriate responses to the situation?

I would look at a larger population, 32 induviduals is a very small population. I usually run GAs with a population at least in the number of chromosomes^2 range (by experience) to get a good starting distribution of individuals.
A possible way to speed things upwith a larger population is to spawn different threads (1 per individual, possibly in batches) when running your fitness function (usually the most expensive part of a GA).
Assuming a population of 32, and a Quad core system, spawn threads in batches of 8 (2 threads per cpu will interleave nicely) and you should be able to run approx 4 * faster.
Therefore if you have a time limit on how long to run your GA, this may be a solution.

You can add to that:
tournament selection instead of roulette wheel
island separated multi population scheme, with migration
restarts
incorporating ideas from estimation of distribution algorithms (EDA) (resampling the domain close to promising areas to introduce new individuals)

How random is random when using different RNG mobile implementations?

Just a foreword: I'm not exactly clear on how a RNG actually works.
If I write a simple routine to randomly pick a number between 0 and 1, and run this n number of times, I would expect a certain, relatively random distribution of numbers that should approach 50/50, give or take with some variance - i.e. a delta of x percent skewed one way or the other.
By looking at this variance, am I going to be able to see any meaningful patterns across a population of different devices?
For example, if I have a large population of iPhones running this routine simultaneously, would they all see a similar variance compared to running them on different days or compared to running a large batch of Android or WP7 devices? Or will the variance truly be random and be all over the place regardless of device or time or any other factor that would affect the randomness of the distribution?

This completely depends on several important factors in the PRNG. A PRNG (Pseudo-Random-Number-Geneator) is an algorithm, which can only simulate random numbers, and is initialized with an input state.
PRNGs are measured with their periodicity (how soon until it loops around), their distribution, and how easy is it to derive the next value from previous values or a known seed. All of these properties are very important to PRNGs used for cryptographic purposes.
So in short, it completely varies upon the algorithm in use by any of those devices. Provided the input state and algorithm are the same, the output can be expected to be the same.
If you want to test the quality of a PRNG, you can use the guidelines in FIPS-140-2, or use the DieHarder test suite.
Also refer to the Wikipedia page.

Modeling distribution of performance measurements

How would you mathematically model the distribution of repeated real life performance measurements - "Real life" meaning you are not just looping over the code in question, but it is just a short snippet within a large application running in a typical user scenario?
My experience shows that you usually have a peak around the average execution time that can be modeled adequately with a Gaussian distribution. In addition, there's a "long tail" containing outliers - often with a multiple of the average time. (The behavior is understandable considering the factors contributing to first execution penalty).
My goal is to model aggregate values that reasonably reflect this, and can be calculated from aggregate values (like for the Gaussian, calculate mu and sigma from N, sum of values and sum of squares). In other terms, number of repetitions is unlimited, but memory and calculation requirements should be minimized.
A normal Gaussian distribution can't model the long tail appropriately and will have the average biased strongly even by a very small percentage of outliers.
I am looking for ideas, especially if this has been attempted/analysed before. I've checked various distributions models, and I think I could work out something, but my statistics is rusty and I might end up with an overblown solution. Oh, a complete shrink-wrapped solution would be fine, too ;)
Other aspects / ideas: Sometimes you get "two humps" distributions, which would be acceptable in my scenario with a single mu/sigma covering both, but ideally would be identified separately.
Extrapolating this, another approach would be a "floating probability density calculation" that uses only a limited buffer and adjusts automatically to the range (due to the long tail, bins may not be spaced evenly) - haven't found anything, but with some assumptions about the distribution it should be possible in principle.
Why (since it was asked) -
For a complex process we need to make guarantees such as "only 0.1% of runs exceed a limit of 3 seconds, and the average processing time is 2.8 seconds". The performance of an isolated piece of code can be very different from a normal run-time environment involving varying levels of disk and network access, background services, scheduled events that occur within a day, etc.
This can be solved trivially by accumulating all data. However, to accumulate this data in production, the data produced needs to be limited. For analysis of isolated pieces of code, a gaussian deviation plus first run penalty is ok. That doesn't work anymore for the distributions found above.
[edit] I've already got very good answers (and finally - maybe - some time to work on this). I'm starting a bounty to look for more input / ideas.

Often when you have a random value that can only be positive, a log-normal distribution is a good way to model it. That is, you take the log of each measurement, and assume that is normally distributed.
If you want, you can consider that to have multiple humps, i.e. to be the sum of two normals having different mean. Those are a bit tricky to estimate the parameters of, because you may have to estimate, for each measurement, its probability of belonging to each hump. That may be more than you want to bother with.
Log-normal distributions are very convenient and well-behaved. For example, you don't deal with its average, you deal with it's geometric mean, which is the same as its median.
BTW, in pharmacometric modeling, log-normal distributions are ubiquitous, modeling such things as blood volume, absorption and elimination rates, body mass, etc.
ADDED: If you want what you call a floating distribution, that's called an empirical or non-parametric distribution. To model that, typically you save the measurements in a sorted array. Then it's easy to pick off the percentiles. For example the median is the "middle number". If you have too many measurements to save, you can go to some kind of binning after you have enough measurements to get the general shape.
ADDED: There's an easy way to tell if a distribution is normal (or log-normal). Take the logs of the measurements and put them in a sorted array. Then generate a QQ plot (quantile-quantile). To do that, generate as many normal random numbers as you have samples, and sort them. Then just plot the points, where X is the normal distribution point, and Y is the log-sample point. The results should be a straight line. (A really simple way to generate a normal random number is to just add together 12 uniform random numbers in the range +/- 0.5.)

The problem you describe is called "Distribution Fitting" and has nothing to do with performance measurements, i.e. this is generic problem of fitting suitable distribution to any gathered/measured data sample.
The standard process is something like that:
Guess the best distribution.
Run hypothesis tests to check how well it describes gathered data.
Repeat 1-3 if not well enough.
You can find interesting article describing how this can be done with open-source R software system here. I think especially useful to you may be function fitdistr.

In addition to already given answers consider Empirical Distributions. I have successful experience in using empirical distributions for performance analysis of several distributed systems. The idea is very straightforward. You need to build histogram of performance measurements. Measurements should be discretized with given accuracy. When you have histogram you could do several useful things:
calculate the probability of any given value (you are bound by accuracy only);
build PDF and CDF functions for the performance measurements;
generate sequence of response times according to a distribution. This one is very useful for performance modeling.

Try whit gamma distribution http://en.wikipedia.org/wiki/Gamma_distribution
From wikipedia
The gamma distribution is frequently a probability model for waiting times; for instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.

The standard for randomized Arrival times for performance modelling is either Exponential distribution or Poisson distribution (which is just the distribution of multiple Exponential distributions added together).

Not exactly answering your question, but relevant still: Mor Harchol-Balter did a very nice analysis of the size of jobs submitted to a scheduler, The effect of heavy-tailed job size distributions on computer systems design (1999). She found that the size of jobs submitted to her distributed task assignment system took a power-law distribution, which meant that certain pieces of conventional wisdom she had assumed in the construction of her task assignment system, most importantly that the jobs should be well load balanced, had awful consequences for submitters of jobs. She's done good follor-up work on this issue.
The broader point is, you need to ask such questions as:
What happens if reasonable-seeming assumptions about the distribution of performance, such as that they take a normal distribution, break down?
Are the data sets I'm looking at really representative of the problem I'm trying to solve?

What are some alternatives to a bit array?

I have an information retrieval application that creates bit arrays on the order of 10s of million bits. The number of "set" bits in the array varies widely, from all clear to all set. Currently, I'm using a straight-forward bit array (java.util.BitSet), so each of my bit arrays takes several megabytes.
My plan is to look at the cardinality of the first N bits, then make a decision about what data structure to use for the remainder. Clearly some data structures are better for very sparse bit arrays, and others when roughly half the bits are set (when most bits are set, I can use negation to treat it as a sparse set of zeroes).
What structures might be good at each extreme?
Are there any in the middle?
Here are a few constraints or hints:
The bits are set only once, and in index order.
I need 100% accuracy, so something like a Bloom filter isn't good enough.
After the set is built, I need to be able to efficiently iterate over the "set" bits.
The bits are randomly distributed, so run-length–encoding algorithms aren't likely to be much better than a simple list of bit indexes.
I'm trying to optimize memory utilization, but speed still carries some weight.
Something with an open source Java implementation is helpful, but not strictly necessary. I'm more interested in the fundamentals.

Unless the data is truly random and has a symmetric 1/0 distribution, then this simply becomes a lossless data compression problem and is very analogous to CCITT Group 3 compression used for black and white (i.e.: Binary) FAX images. CCITT Group 3 uses a Huffman Coding scheme. In the case of FAX they are using a fixed set of Huffman codes, but for a given data set, you can generate a specific set of codes for each data set to improve the compression ratio achieved. As long as you only need to access the bits sequentially, as you implied, this will be a pretty efficient approach. Random access would create some additional challenges, but you could probably generate a binary search tree index to various offset points in the array that would allow you to get close to the desired location and then walk in from there.
Note: The Huffman scheme still works well even if the data is random, as long as the 1/0 distribution is not perfectly even. That is, the less even the distribution, the better the compression ratio.
Finally, if the bits are truly random with an even distribution, then, well, according to Mr. Claude Shannon, you are not going to be able to compress it any significant amount using any scheme.

I would strongly consider using range encoding in place of Huffman coding. In general, range encoding can exploit asymmetry more effectively than Huffman coding, but this is especially so when the alphabet size is so small. In fact, when the "native alphabet" is simply 0s and 1s, the only way Huffman can get any compression at all is by combining those symbols -- which is exactly what range encoding will do, more effectively.

Maybe too late for you, but there is a very fast and memory efficient library for sparse bit arrays (lossless) and other data types based on tries. Look at Judy arrays

Thanks for the answers. This is what I'm going to try for dynamically choosing the right method:
I'll collect all of the first N hits in a conventional bit array, and choose one of three methods, based on the symmetry of this sample.
If the sample is highly asymmetric,
I'll simply store the indexes to the
set bits (or maybe the distance to
the next bit) in a list.
If the sample is highly symmetric,
I'll keep using a conventional bit
array.
If the sample is moderately
symmetric, I'll use a lossless
compression method like Huffman
coding suggested by
InSciTekJeff.
The boundaries between the asymmetric, moderate, and symmetric regions will depend on the time required by the various algorithms balanced against the space they need, where the relative value of time versus space would be an adjustable parameter. The space needed for Huffman coding is a function of the symmetry, and I'll profile that with testing. Also, I'll test all three methods to determine the time requirements of my implementation.
It's possible (and actually I'm hoping) that the middle compression method will always be better than the list or the bit array or both. Maybe I can encourage this by choosing a set of Huffman codes adapted for higher or lower symmetry. Then I can simplify the system and just use two methods.

One more compression thought:
If the bit array is not crazy long, you could try applying the Burrows-Wheeler transform before using any repetition encoding, such as Huffman. A naive implementation would take O(n^2) memory during (de)compression and O(n^2 log n) time to decompress - there are almost certainly shortcuts to be had, as well. But if there's any sequential structure to your data at all, this should really help the Huffman encoding out.
You could also apply that idea to one block at a time to keep the time/memory usage more practical. Using one block at time could allow you to always keep most of the data structure compressed if you're reading/writing sequentially.

Straight forward lossless compression is the way to go. To make it searchable you will have to compress relatively small blocks and create an index into an array of the blocks. This index can contain the bit offset of the starting bit in each block.

Quick combinatoric proof that you can't really save much space:
Suppose you have an arbitrary subset of n/2 bits set to 1 out of n total bits. You have (n choose n/2) possibilities. Using Stirling's formula, this is roughly 2^n / sqrt(n) * sqrt(2/pi). If every possibility is equally likely, then there's no way to give more likely choices shorter representations. So we need log_2 (n choose n/2) bits, which is about n - (1/2)log(n) bits.
That's not a very good savings of memory. For example, if you're working with n=2^20 (1 meg), then you can only save about 10 bits. It's just not worth it.
Having said all that, it also seems very unlikely that any really useful data is truly random. In case there's any more structure to your data, there's probably a more optimistic answer.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio