Code is taking so much time to execute? - java-8

Random random = new Random();
random.ints()
.filter(e -> e<40 && e>1)
.limit(10)
.sorted()
.distinct()
.forEach(System.out::println);
i need distinct data but faster too. By removing this distinct execution be a little faster

As already mentioned in this answer, you should use the three-arg factory method rather than filter and limit. This avoids generating random values which you will drop afterwards. For some stream operations, it’s additionally beneficial to know the result stream size in advance (which non-intuitively doesn’t apply to limit).
Further, you should use ThreadLocalRandom, if you know that you are going to process the elements sequentially (otherwise, use new SplittableRandom(), if you are going to process a parallel stream).
ThreadLocalRandom.current().ints(10,2,40)
.sorted()
.distinct()
.forEach(System.out::println);
If that’s still to expensive, use
ThreadLocalRandom.current().ints(10,2,40)
.collect(BitSet::new, BitSet::set, BitSet::or)
.stream()
.forEach(System.out::println);
as a BitSet is intrinsically sorted and distinct. This is much more efficient than the current IntStream implementation, which will collect into an int[] array to sort the values and use a HashSet<Integer> to determine whether it has already seen values.

Your code is producing random ints in the range from Integer.MIN_VALUE to Integer.MAX_VALUE. Only a tiny portion of those values fits your required bounds. So a huge amount of random numbers is generated of which you discard > 99% .... This costs a lot of time.
Random has a 3-arguments version of ints() that can replace half your code. The arguments are the streamsize, and the upper and lower bounds of the numbers.
Your code would look like this:
Random random = new Random();
random.ints(10,2,40)
.sorted()
.distinct()
.forEach(System.out::println);
Just notice that you're not guaranteed to get 10 numbers, since distinct is used after the streamsize is set.

Related

Right way to permute [duplicate]

I've been using Random (java.util.Random) to shuffle a deck of 52 cards. There are 52! (8.0658175e+67) possibilities. Yet, I've found out that the seed for java.util.Random is a long, which is much smaller at 2^64 (1.8446744e+19).
From here, I'm suspicious whether java.util.Random is really that random; is it actually capable of generating all 52! possibilities?
If not, how can I reliably generate a better random sequence that can produce all 52! possibilities?
Selecting a random permutation requires simultaneously more and less randomness than what your question implies. Let me explain.
The bad news: need more randomness.
The fundamental flaw in your approach is that it's trying to choose between ~2226 possibilities using 64 bits of entropy (the random seed). To fairly choose between ~2226 possibilities you're going to have to find a way to generate 226 bits of entropy instead of 64.
There are several ways to generate random bits: dedicated hardware, CPU instructions, OS interfaces, online services. There is already an implicit assumption in your question that you can somehow generate 64 bits, so just do whatever you were going to do, only four times, and donate the excess bits to charity. :)
The good news: need less randomness.
Once you have those 226 random bits, the rest can be done deterministically and so the properties of java.util.Random can be made irrelevant. Here is how.
Let's say we generate all 52! permutations (bear with me) and sort them lexicographically.
To choose one of the permutations all we need is a single random integer between 0 and 52!-1. That integer is our 226 bits of entropy. We'll use it as an index into our sorted list of permutations. If the random index is uniformly distributed, not only are you guaranteed that all permutations can be chosen, they will be chosen equiprobably (which is a stronger guarantee than what the question is asking).
Now, you don't actually need to generate all those permutations. You can produce one directly, given its randomly chosen position in our hypothetical sorted list. This can be done in O(n2) time using the Lehmer[1] code (also see numbering permutations and factoriadic number system). The n here is the size of your deck, i.e. 52.
There is a C implementation in this StackOverflow answer. There are several integer variables there that would overflow for n=52, but luckily in Java you can use java.math.BigInteger. The rest of the computations can be transcribed almost as-is:
public static int[] shuffle(int n, BigInteger random_index) {
int[] perm = new int[n];
BigInteger[] fact = new BigInteger[n];
fact[0] = BigInteger.ONE;
for (int k = 1; k < n; ++k) {
fact[k] = fact[k - 1].multiply(BigInteger.valueOf(k));
}
// compute factorial code
for (int k = 0; k < n; ++k) {
BigInteger[] divmod = random_index.divideAndRemainder(fact[n - 1 - k]);
perm[k] = divmod[0].intValue();
random_index = divmod[1];
}
// readjust values to obtain the permutation
// start from the end and check if preceding values are lower
for (int k = n - 1; k > 0; --k) {
for (int j = k - 1; j >= 0; --j) {
if (perm[j] <= perm[k]) {
perm[k]++;
}
}
}
return perm;
}
public static void main (String[] args) {
System.out.printf("%s\n", Arrays.toString(
shuffle(52, new BigInteger(
"7890123456789012345678901234567890123456789012345678901234567890"))));
}
[1] Not to be confused with Lehrer. :)
Your analysis is correct: seeding a pseudo-random number generator with any specific seed must yield the same sequence after a shuffle, limiting the number of permutations that you could obtain to 264. This assertion is easy to verify experimentally by calling Collection.shuffle twice, passing a Random object initialized with the same seed, and observing that the two random shuffles are identical.
A solution to this, then, is to use a random number generator that allows for a larger seed. Java provides SecureRandom class that could be initialized with byte[] array of virtually unlimited size. You could then pass an instance of SecureRandom to Collections.shuffle to complete the task:
byte seed[] = new byte[...];
Random rnd = new SecureRandom(seed);
Collections.shuffle(deck, rnd);
In general, a pseudorandom number generator (PRNG) can't choose from among all permutations of a 52-item list if its maximum cycle length is less than 226 bits.
java.util.Random implements an algorithm with a modulus of 248 and a maximum cycle length of not more than that, so much less than 2226 (corresponding to the 226 bits I referred to). You will need to use another PRNG with a bigger cycle length, specifically one with a maximum cycle length of 52 factorial or greater.
See also "Shuffling" in my article on random number generators.
This consideration is independent of the nature of the PRNG; it applies equally to cryptographic and noncryptographic PRNGs (of course, noncryptographic PRNGs are inappropriate whenever information security is involved).
Although java.security.SecureRandom allows seeds of unlimited length to be passed in, the SecureRandom implementation could use an underlying PRNG (e.g., "SHA1PRNG" or "DRBG"). And it depends on that PRNG's maximum cycle length whether it's capable of choosing from among 52 factorial permutations.
Let me apologize in advance, because this is a little tough to understand...
First of all, you already know that java.util.Random is not completely random at all. It generates sequences in a perfectly predictable way from the seed. You are completely correct that, since the seed is only 64 bits long, it can only generate 2^64 different sequences. If you were to somehow generate 64 real random bits and use them to select a seed, you could not use that seed to randomly choose between all of the 52! possible sequences with equal probability.
However, this fact is of no consequence as long as you're not actually going to generate more than 2^64 sequences, as long as there is nothing 'special' or 'noticeably special' about the 2^64 sequences that it can generate.
Lets say you had a much better PRNG that used 1000-bit seeds. Imagine you had two ways to initialize it -- one way would initialize it using the whole seed, and one way would hash the seed down to 64 bits before initializing it.
If you didn't know which initializer was which, could you write any kind of test to distinguish them? Unless you were (un)lucky enough to end up initializing the bad one with the same 64 bits twice, then the answer is no. You could not distinguish between the two initializers without some detailed knowledge of some weakness in the specific PRNG implementation.
Alternatively, imagine that the Random class had an array of 2^64 sequences that were selected completely and random at some time in the distant past, and that the seed was just an index into this array.
So the fact that Random uses only 64 bits for its seed is actually not necessarily a problem statistically, as long as there is no significant chance that you will use the same seed twice.
Of course, for cryptographic purposes, a 64 bit seed is just not enough, because getting a system to use the same seed twice is computationally feasible.
EDIT:
I should add that, even though all of the above is correct, that the actual implementation of java.util.Random is not awesome. If you are writing a card game, maybe use the MessageDigest API to generate the SHA-256 hash of "MyGameName"+System.currentTimeMillis(), and use those bits to shuffle the deck. By the above argument, as long as your users are not really gambling, you don't have to worry that currentTimeMillis returns a long. If your users are really gambling, then use SecureRandom with no seed.
I'm going to take a bit of a different tack on this. You're right on your assumptions - your PRNG isn't going to be able to hit all 52! possibilities.
The question is: what's the scale of your card game?
If you're making a simple klondike-style game? Then you definitely don't need all 52! possibilities. Instead, look at it like this: a player will have 18 quintillion distinct games. Even accounting for the 'Birthday Problem', they'd have to play billions of hands before they'd run into the first duplicate game.
If you're making a monte-carlo simulation? Then you're probably okay. You might have to deal with artifacts due to the 'P' in PRNG, but you're probably not going to run into problems simply due to a low seed space (again, you're looking at quintillions of unique possibilities.) On the flip side, if you're working with large iteration count, then, yeah, your low seed space might be a deal-breaker.
If you're making a multiplayer card game, particularly if there's money on the line? Then you're going to need to do some googling on how the online poker sites handled the same problem you're asking about. Because while the low seed space issue isn't noticeable to the average player, it is exploitable if it's worth the time investment. (The poker sites all went through a phase where their PRNGs were 'hacked', letting someone see the hole cards of all the other players, simply by deducing the seed from exposed cards.) If this is the situation you're in, don't simply find a better PRNG - you'll need to treat it as seriously as a Crypto problem.
Short solution which is essentially the same of dasblinkenlight:
// Java 7
SecureRandom random = new SecureRandom();
// Java 8
SecureRandom random = SecureRandom.getInstanceStrong();
Collections.shuffle(deck, random);
You don't need to worry about the internal state. Long explanation why:
When you create a SecureRandom instance this way, it accesses an OS specific
true random number generator. This is either an entropy pool where values are
accessed which contain random bits (e.g. for a nanosecond timer the nanosecond
precision is essentially random) or an internal hardware number generator.
This input (!) which may still contain spurious traces are fed into a
cryptographically strong hash which removes those traces. That is the reason those CSPRNGs are used, not for creating those numbers themselves! The SecureRandom has a counter which traces how many bits were used (getBytes(), getLong() etc.) and refills the SecureRandom with entropy bits when necessary.
In short: Simply forget objections and use SecureRandom as true random number generator.
If you consider the number as just an array of bits (or bytes) then maybe you could use the (Secure)Random.nextBytes solutions suggested in this Stack Overflow question, and then map the array into a new BigInteger(byte[]).
A very simple algorithm is to apply SHA-256 to a sequence of integers incrementing from 0 upwards. (A salt can be appended if desired to "get a different sequence".) If we assume that the output of SHA-256 is "as good as" uniformly distributed integers between 0 and 2256 - 1 then we have enough entropy for the task.
To get a permutation from the output of SHA256 (when expressed as an integer) one simply needs to reduce it modulo 52, 51, 50... as in this pseudocode:
deck = [0..52]
shuffled = []
r = SHA256(i)
while deck.size > 0:
pick = r % deck.size
r = floor(r / deck.size)
shuffled.append(deck[pick])
delete deck[pick]
My Empirical research results are Java.Random is not totally truly random. If you try yourself by using Random class "nextGaussian()"-method and generate enough big sample population for numbers between -1 and 1, the graph is normal distbruted field know as Gaussian Model.
Finnish goverment owned gambling-bookmarker have a once per day whole year around every day drawn lottery-game where winning table shows that the Bookmarker gives winnings in normal distrbuted way. My Java Simulation with 5 million draws shows me that with nextInt() -methdod used number draw, winnings are normally distributed same kind of like the my Bookmarker deals the winnings in each draw.
My best picks are avoiding numbers 3 and 7 in each of ending ones and that's true that they are rarely in winning results. Couple of times won five out of five picks by avoiding 3 and 7 numbers in ones column in Integer between 1-70 (Keno).
Finnish Lottery drawn once per week Saturday evenings If you play System with 12 numbers out of 39, perhaps you get 5 or 6 right picks in your coupon by avoiding 3 and 7 values.
Finnish Lottery have numbers 1-40 to choose and it takes 4 coupon to cover all the nnumbers with 12 number system. The total cost is 240 euros and in long term it's too expensive for the regural gambler to play without going broke. Even if you share coupons to other customers available to buy still you have to be quite a lucky if you want to make profit.

Is there any probabilistic data structure that reduces the space complexity of a large number of counters?

Basically I need to keep track of a large number of counters. I can increment or decrement each counter by name. The simplest way to do so is to use a hash table, using counter_name as key and its corresponding count as the value for that key.
The counters don't need to be 100% accurate, approximate values for count are fine. So I'm wondering if there is any probabilistic data structure that can reduce the space complexity of N counters to lower than O(N), kinda similar to how HyperLogLog reduces the memory requirement of counting N items by giving only an approximate result. Any ideas?
In my opinion, the thing you are looking for is Count-min sketch.
Reading a stream of elements a1, a2, a3, ..., an where there can be a
lot of repeated elements, in any time it will give you the answer to
the following question: how many ai elements have you seen so far.
basically your unique elements can be bijected into your counters. Countmin sketch allows you to adjust parameters to trade your memory for the accuracy.
P.S. I described some other popular probabilistic data structures here.
Stefan Haustein's correct that the names are likely to take more space than the counters, and you may be able to prioritise certain names as he suggests, but failing that you can consider how best to store the names. If they're fairly short (e.g. 8 characters or less), you might consider using a closed hashing table that stores them directly in the buckets. If they're long, you could store them contiguously (NUL terminated) in a block of memory, and in the hash table store the offset into that block of their first character.
For the counter itself, you can save space by using a probabilistic approach as follows:
template <typename T, typename Q = unsigned>
class Approx_Counter
{
public:
Approx_Counter() : n_(0) { }
Approx_Counter& operator++()
{
if (n_ < 2 || rand() % (operator Q()) == 0)
++n_;
return *this;
}
operator Q() const { return n_ < 2 ? n_ : 1 << n_; }
private:
T n_;
};
Then you can use e.g. Approx_Counter<unsigned char, unsigned long>. Swap out rand() for a C++11 generator if you care.
The idea's simple:
when n_ is 0, ++ has definitely not be invoked
when n_ is 1, ++ has definitely been invoked exactly once
when n_ >= 2, it indicates ++ has probably been invoked about 2n_ times
To keep that last implication in line with the number of ++ invocations actually made, each invocation has a 1 in 2n_ chance of actually incrementing n_ again.
Just make sure your rand() or substitute returns values much larger than the largest counter value you want to track, otherwise you'll get rand() % (operator Q()) == 0 too often and increment inappropriately.
That said, having a smaller counter doesn't help much if you have pointers or offsets to it, so you'll want to squeeze the counter into the bucket too, another reason to prefer your own closed hashing implementation if you genuinely need to tighten up memory usage but want to stick with a hash table (a trie is another possibility).
The above is still O(N) in counter space, just with a smaller constant. For genuinely < O(N) options, you need to consider whether/how keys are related, such that incrementing a counter might reasonable impact multiple keys. You've given us no insights in your question to date.
The names probably take up more space than the counters.
How about having a fixed number of counters and only keep the ones with the highest counts, plus some kind of LRU mechanism to allow new counters to rise to the top? I guess it really depends on your use case...

How to run a method in parallel using Julia?

I was reading Parallel Computing docs of Julia, and having never done any parallel coding, I was left wanting a gentler intro. So, I thought of a (probably) simple problem that I couldn't figure out how to code in parallel Julia paradigm.
Let's say I have a matrix/dataframe df from some experiment. Its N rows are variables, and M columns are samples. I have a method pwCorr(..) that calculates pairwise correlation of rows. If I wanted an NxN matrix of all the pairwise correlations, I'd probably run a for-loop that'd iterate for N*N/2 (upper or lower triangle of the matrix) and fill in the values; however, this seems like a perfect thing to parallelize since each of the pwCorr() calls are independent of others. (Am I correct in thinking this way about what can be parallelized, and what cannot?)
To do this, I feel like I'd have to create a DArray that gets filled by a #parallel for loop. And if so, I'm not sure how this can be achieved in Julia. If that's not the right approach, I guess I don't even know where to begin.
This should work, first you need to propagate the top level variable (data) to all the workers:
for pid in workers()
remotecall(pid, x->(global data; data=x; nothing), data)
end
then perform the computation in chunks using the DArray constructor with some fancy indexing:
corrs = DArray((20,20)) do I
out=zeros(length(I[1]),length(I[2]))
for i=I[1], j=I[2]
if i<j
out[i-minimum(I[1])+1,j-minimum(I[2])+1]= 0.0
else
out[i-minimum(I[1])+1,j-minimum(I[2])+1] = cor(vec(data[i,:]), vec(data[j,:]))
end
end
out
end
In more detail, the DArray constructor takes a function which takes a tuple of index ranges and returns a chunk of the resulting matrix which corresponds to those index ranges. In the code above, I is the tuple of ranges with I[1] being the first range. You can see this more clearly with:
julia> DArray((10,10)) do I
println(I)
return zeros(length(I[1]),length(I[2]))
end
From worker 2: (1:10,1:5)
From worker 3: (1:10,6:10)
where you can see it split the array into two chunks on the second axis.
The trickiest part of the example was converting from these 'global' index ranges to local index ranges by subtracting off the minimum element and then adding back 1 for the 1 based indexing of Julia.
Hope that helps!

"Resetting" pseudo-random number generator seed multiple times?

Today, my friend had a thought that setting the seed of a pseudo-random number generator multiple times using the pseudo-random number generated to "make things more randomized".
An example in C#:
// Initiate one with a time-based seed
Random rand = new Random(milliseconds_since_unix_epoch());
// Then loop for a_number_of_times...
for (int i = 0; i < a_number_of_times; i++)
{
// ... to initiate with the next random number generated
rand = new Random(rand.Next());
}
// So is `rand` now really random?
assert(rand.Next() is really_random);
But I was thinking that this could probably increase the chance of getting a repeated seed being used for the pseudo-random number generator.
Will this
make things more randomized,
making it loop through a certain number of seeds used, or
does nothing to the randomness (i.e. neither increase nor decrease)?
Could any expert in pseudo-random number generators give some detailed explanations so that I can convince my friend? I would be happy to see answers explaining further detail in some pseudo-random number generator algorithm.
There are three basic levels of use for pseudorandom numbers. Each level subsumes the one below it.
Unexpected numbers with no particular correlation guarantees. Generators at this level typically have some hidden correlations that might matter to you, or might not.
Statistically-independent number with known non-correlation. These are generally required for numerical simulations.
Cryptographically secure numbers that cannot be guessed. These are always required when security is at issue.
Each of these is deterministic. A random number generator is an algorithm that has some internal state. Applying the algorithm once yields a new internal state and an output number. Seeding the generator means setting up an internal state; it's not always the case that the seed interface allows setting up every possible internal state. As a good rule of thumb, always assume that the default library random() routine operates at only the weakest level, level 1.
To answer your specific question, the algorithm in the question (1) cannot increase the randomness and (2) might decrease it. The expectation of randomness, thus, is strictly lower than seeding it once at the beginning. The reason comes from the possible existence of short iterative cycles. An iterative cycle for a function F is a pair of integers n and k where F^(n) (k) = k, where the exponent is the number of times F is applied. For example, F^(3) (x) = F(F(F(x))). If there's a short iterative cycle, the random numbers will repeat more often than they would otherwise. In the code presented, the iteration function is to seed the generator and then take the first output.
To answer a question you didn't quite ask, but which is relevant to getting an understanding of this, seeding with a millisecond counter makes your generator fail the test of level 3, unguessability. That's because the number of possible milliseconds is cryptographically small, which is a number known to be subject to exhaustive search. As of this writing, 2^50 should be considered cryptographically small. (For what counts as cryptographically large in any year, please find a reputable expert.) Now the number of milliseconds in a century is approximately 2^(41.5), so don't rely on that form of seeding for security purposes.
Your example won't increase the randomness because there is no increase in entropy. It is simply derived from the execution time of the program.
Instead of using something based of the current time, computers maintain an entropy pool, and build it up with data that is statistically random (or at least, unguessable). For example, the timing delay between network packets, or key-strokes, or hard-drive read times.
You should tap into that entropy pool if you want good random numbers. These are known as Cryptographically secure pseudorandom number generators.
In C#, see the Cryptography.RandomNumberGenerator Class for the right way to get a secure random number.
This will not make things more "random".
Our seed determines the random looking but completely determined sequence of numbers that rand.next() gives us.
Instead of making things more random, your code defines a mapping from your initial seed to some final seed, and, given the same initial seed, you will always end up with the same final seed.
Try playing with this code and you will see what I mean (also, here is a link to a version you can run in your browser):
int my_seed = 100; // change my seed to whatever you want
Random rand = new Random(my_seed);
for (int i = 0; i < a_number_of_times; i++)
{
rand = new Random(rand.Next());
}
// does this print the same number every run if we don't change the starting seed?
Console.WriteLine(rand.Next()); // yes, it does
The Random object with this final seed is just like any other Random object. It just took you more time then necessary to create it.

Is this linear search implementation actually useful?

In Matters Computational I found this interesting linear search implementation (it's actually my Java implementation ;-)):
public static int linearSearch(int[] a, int key) {
int high = a.length - 1;
int tmp = a[high];
// put a sentinel at the end of the array
a[high] = key;
int i = 0;
while (a[i] != key) {
i++;
}
// restore original value
a[high] = tmp;
if (i == high && key != tmp) {
return NOT_CONTAINED;
}
return i;
}
It basically uses a sentinel, which is the searched for value, so that you always find the value and don't have to check for array boundaries. The last element is stored in a temp variable, and then the sentinel is placed at the last position. When the value is found (remember, it is always found due to the sentinel), the original element is restored and the index is checked if it represents the last index and is unequal to the searched for value. If that's the case, -1 (NOT_CONTAINED) is returned, otherwise the index.
While I found this implementation really clever, I wonder if it is actually useful. For small arrays, it seems to be always slower, and for large arrays it only seems to be faster when the value is not found. Any ideas?
EDIT
The original implementation was written in C++, so that could make a difference.
It's not thread-safe, for example, you can lose your a[high] value through having a second thread start after the first has changed a[high] to key, so will record key to tmp, and finish after the first thread has restored a[high] to its original value. The second thread will restore a[high] to what it first saw, which was the first thread's key.
It's also not useful in java, since the JVM will include bounds checks on your array, so your while loop is checking that you're not going past the end of your array anyway.
Will you ever notice any speed increase from this? No
Will you notice a lack of readability? Yes
Will you notice an unnecessary array mutation that might cause concurrency issues? Yes
Premature optimization is the root of all evil.
Doesn't seem particularly useful. The "innovation" here is just to get rid of the iteration test by combining it with the match test. Modern processors spend 0 time on iteration checks these days (all the computation and branching gets done in parallel with the match test code).
In any case, binary search kicks the ass of this code on large arrays, and is comparable on small arrays. Linear search is so 1960s.
See also the 'finding a tiger in africa' joke.
Punchline = An experienced programmer places a tiger in cairo so that the search terminates.
A sentinel search goes back to Knuth. It value is that it reduces the number of tests in a loop from two ("does the key match? Am I at the end?") to just one.
Yes, its useful, in the sense that it should significantly reduce search times for modest size unordered arrays, by virtue of eliminating conditional branch mispredictions. This also reduces insertion times (code not shown by the OP) for such arrays, because you don't have to order the items.
If you have larger arrays of ordered items, a binary search will be faster, at the cost of larger insertion time to ensure the array is ordered.
For even larger sets, a hash table will be the fastest.
The real question is what is the distribution of sizes of your arrays?
Yes - it does because while loop doesn't have 2 comparisons as opposed to standard search.
It is twice as fast.It is given as optimization in Knuth Vol 3.

Resources