Checking noise in binary sequence for randomness

Checking noise in binary sequence for randomness - random

I have a data set that is made up of a binary sequence, e.g.
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, ...
the probability of 0 and 1 (noise) is different with 1 being less frequent. I want to know if these 1s happens in groups or they are really just random. How can I tell?
If I feed it into randomness test, it will sure tell me the sequence are heavily gravitating toward 0. Would measuring the gap between 1s be a good test? I am most familiar with Python and C.

Here, the word "random" means not only identically distributed (biased the same way), but also independent (that is, independent of any other choice). In general, randomness tests are more reliable on the first part of this definition ("identically distributed") than on the second ("independent").
In general, you can't tell from a sequence of bits alone whether the process generated them in an independent and identically distributed way, unless you know what that process is. Thus, although you can tell that a given sequence of bits has more zeros than ones, you can't tell whether those bits—
were truly generated independently of any other choice, or
form part of an extremely long periodic sequence that is only "locally random", or
were simply reused from another process, or
were produced in some other way,
without more information on the process.

Related

understanding seed of a ByteTensor in PyTorch

I understand that a seed is a number used to initialize pseudo-random number generator. in pytorch, torch.get_rng_state documentation states as follows "Returns the random number generator state as a torch.ByteTensor.". and when i print it i get a 1-d tensor of size 5048 whose values are as shown below
tensor([ 80, 78, 248, ..., 0, 0, 0], dtype=torch.uint8)
why does a seed have 5048 values and how is this different from usual seed which we can get using torch.initial_seed

It sounds like you're thinking of the seed and the state as equivalent. For older pseudo-random number generators (PRNGs) that was true, but with more modern PRNGs tend to work as described here. (The answer in the link was written with respect to Mersenne Twister, but the concepts apply equally to other generators.)
Why is it a good idea to not have a 32- or 64-bit state space and report the state as the generator's output? Because if you do that, as soon as you see any value repeat the entire sequence will repeat. PRNGs were designed to be "full cycle," i.e., to iterate through the maximum number of values possible before repeating. This paper showed that the birthday problem could quickly (O(sqrt(cycle-length)) identify such PRNGs as non-random. This meant, for instance, that with 32-bit integers you shouldn't use more than ~50000 values before a statistician could call you out with a better than 99% level of confidence. The solution, used by many modern PRNGs, is to have a larger state space and collapse it down to output a 32- or 64-bit result. Since multiple states can produce the same output, duplicates will occur in the output stream without the entire stream being replicated. It looks like that's what PyTorch is doing.
Given the larger state space, why allow seeding with a single integer? Convenience. For instance, Mersenne Twister has a 19,937 bit state space, but most people don't want to enter that much info to kick-start it. You can if you want to, but most people use the front-end which populates the full state space from a single integer input.

Case study of streams of digits

I'm doing a case study of a random number portal. The portal displays a sequence of numbers (1 to 49) that changes every 4:25 (about 4 1/2 minutes) to a new sequence of numbers.
Examples:
previous stream:
36, 1, 37, 6, 17, 48
Current Stream :
45, 4, 49, 30, 41, 16
What will the next stream will be?
Can we reverse engineer the current output of streams of numbers to get the next stream ?

No. First of all, you specified a random portal -- which, by definition of "random" cannot be predicted from any preceding sequence of output.
If you mean a pseudo-random sequence, then reverse-engineering is theoretically possible, but you must have enough knowledge of the RNG (random-number generator) to reduce the possible outputs to 1 from the 6^49 possible sequences (you didn't specify numbers unique within the stream of 6; if that's another oversight, then it's 49!/(49-6)! If order is unimportant, then divide again by 6!).
Look at the value of information you've presented here: 12 numbers in a particular sequence. Divide the quantity of possible continuations by that value ... the result is far more than 1.
If you can provide the characteristics of the RNG, and those characteristics are sufficiently restrictive, then perhaps it's possible to determine the future sequence. Until then, the answer remains a resounding NO.
UPDATE per OP's comment
If the application is, indeed, a TRNG, then there's your answer: see my first paragraph.
If you're trying to implement a linear congruential RNG (e.g. the equation you posted), then simply check the myriad available hits and pick one that looks good to you. Getting a set of six numbers is a simply calling the generator six times.
Either way, there is still insufficient information to definitively obtain the parameters of even a generic linear congruential RNG. Do you have bounds on the values of a and c? Do you know the range of the X values and how they're converted to the range [1,49]?

How does choosing between pre and post zero padding of sequences impact results

I'm working on an NLP sequence labelling problem. My data consists of variable length sequences (w_1, w_2, ..., w_k) with corresponding labels (l_1, l_2, ..., l_k) (in this case the task is named entity extraction).
I intend to solve the problem using Recurrent Neural Networks. As the sequences are of variable length I need to pad them (I want batch size >1). I have the option of either pre zero padding them, or post zero padding them. I.e. either I make every sequence (0, 0, ..., w_1, w_2, ..., w_k) or (w_1, w_2, ..., w_k, 0, 0, ..., 0) such that the lenght of each sequence is the same.
How does the choice between pre- and post padding impact results?
It seems like pre padding is more common, but I can't find an explanation of why it would be better. Due to the nature of RNNs it feels like an arbitrary choice for me, since they share weights across time steps.

Commonly in RNN's, we take the final output or hidden state and use this to make a prediction (or do whatever task we are trying to do).
If we send a bunch of 0's to the RNN before taking the final output (i.e. 'post' padding as you describe), then the hidden state of the network at the final word in the sentence would likely get 'flushed out' to some extent by all the zero inputs that come after this word.
So intuitively, this might be why pre-padding is more popular/effective.

This paper (https://arxiv.org/pdf/1903.07288.pdf) studied the effect of padding types on LSTM and CNN. They found that post-padding achieved substantially lower accuracy (nearly half) compared to pre-padding in LSTMs, although there wasn't a significant difference for CNNs (post-padding was only slightly worse).
A simple/intuitive explanation for RNNs is that, post-padding seems to add noise to what has been learned from the sequence through time, and there aren't more timesteps for the RNN to recover from this noise. With pre-padding, however, the RNN is better able to adjust to the added noise of zeros at the beginning as it learns from the sequence through time.
I think more thorough experiments are needed in the community for more detailed mechanistic explanations on how padding affects performance.
I always recommend using pre-padding over post-padding, even for CNNs, unless the problem specifically requires post-padding.

Get 1, 0, -1 as positive, zero, or negative for an integer with math only

I have a situation where I'm performing a calculate over a huge number of rows, and I can really increase the performance if I can eschew a conditional statement.
What I need is for a given positive, zero, or negative integer I want the result 1, 0, -1 respectively.
So if I do col/ABS(col), I will get 1 for a positive number, and -1 for a negative number, but of course if col equals 0 then I'll get an error. I can't get an error.
This seems simple enough, but I can't wrap my ahead around it.

Assuming either two's complement 32-bit integers, or one's complement with no negative-zero to worry about, then the following works well:
(x>>31) - (-x>>31);
Replace 31 with 63 for 64-bit integers, and so on.

col/max(1, abs(col))
Ugly but works. For integers, that is. For floating point values where there's no well-defined smallest positive value, you're stuck unless the language allows you to look into it as a bit sequence, then you can just do the same with the sign flag and the significand.
Whether this helps optimising anything is highly debatable though. It certainly makes things harder to read.

I am doing this in an SSAS tabular model, which doesn't have a MAX function like in #bizclop's answer (which helps in many other applications such as EXCEL).
I ended up doing the following which was inspired by the accepted answer:
ROUND([col] / (ABS([col]) + 1), 0)
This ended up reducing my query time quite significantly (40%) versus IF([col] <> 0, [col]/ABS([col], 0).

How exactly does PC/Mac generates random numbers for either 0 or 1?

This question is NOT about how to use any language to generate a random number between any interval. It is about generating either 0 or 1.
I understand that many random generator algorithm manipulate the very basic random(0 or 1) function and take seed from users and use an algorithm to generate various random numbers as needed.
The question is that how the CPU generate either 0 or 1? If I throw a coin, I can generate head or tailer. That's because I physically throw a coin and let the nature decide. But how does CPU do it? There must be an action that the CPU does (like throwing a coin) to get either 0 or 1 randomly, right?
Could anyone tell me about it?
Thanks

(This has several facets and thus several algorithms. Keep in mind that there are many different forms of randomness used for different purposes, but I understand your question in the way that you are interested in actual randomness used for cryptography.)
The fundamental problem here is that computers are (mostly) deterministic machines. Given the same input in the same state they always yield the same result. However, there are a few ways of actually gathering entropy:
User input. Since users bring outside input into the system you can take that to derive some bits from that. Similar to how you could use radioactive decay or line noise.
Network activity. Again, an outside source of stuff.
Generally interrupts (which kinda include the first two).
As alluded to in the first item, noise from peripherals, such as audio input or a webcam can be used.
There is dedicated hardware that can generate a few hundred MiB of randomness per second. Usually they give you random numbers directly instead of their internal entropy, though.
How exactly you derive bits from that is up to you but you could use time between events, or actual content from the events, etc. – generally eliminating bias from entropy sources isn't easy or trivial and a lot of thought and algorithmic work goes into that (in the case of the aforementioned special hardware this is all done in hardware and the code using it doesn't need to care about it).
Once you have a pool of actually random bits you can just use them as random numbers (/dev/random on Linux does that). But this has downsides, since there is usually little actual entropy and possibly a higher demand for random numbers. So you can invent algorithms to “stretch” that initial randomness in a manner that makes it still impossible or at least very difficult to predict anything about following numbers (/dev/urandom on Linux or both /dev/random and /dev/urandom on FreeBSD do that). Fortuna and Yarrow are so-called cryptographically secure pseudo-random number generators and designed with that in mind. You still have a very good guarantee about the quality of random numbers you generate, but have many more before your entropy pool runs out.
In any case, the CPU itself cannot give you a random 0 or 1. There's a lot more involved and this usually includes the complete computer system or special hardware built for that purpose.
There is also a second class of computational randomness: Plain vanilla pseudo-random number generators (PRNGs). What I said earlier about determinism – this is the embodiment of it. Given the same so-called seed a PRNG will yield the exact same sequence of numbers every time¹. While this sounds idiotic it has practical benefits.
Suppose you run a simulation involving lots of random numbers, maybe to simulate interaction between molecules or atoms that involve certain probabilities and unpredictable behaviour. In science you want results anyone can independently verify, given the same setup and procedure (or, with computing, the same algorithms). If you used actual randomness the only option you have would be to save every single random number used to make sure others can replicate the results independently.
But with a PRNG all you need to save is the seed and remember what algorithm you used. Others can then get the exact same sequence of pseudo-random numbers independently. Very nice property to have :-)
Footnotes
¹ This even includes the CSPRNGs mentioned above, but they are designed to be used in a special way that includes regular re-seeding with entropy to overcome that problem.

A CPU can only generate a uniform random number, U(0,1), which happens to range from 0 to 1. So mathematically, it would be defined as a random variable U in the range [0,1]. Examples of random draws of a U(0,1) random number in the range 0 to 1 would be 0.28100002, 0.34522, 0.7921, etc. The probability of any value between 0 and 1 is equal, i.e., they are equiprobable.
You can generate binary random variates that are either 0 or 1 by setting a random draw of U(0,1) to a 0 if U(0,1)<=0.5 and 1 if U(0,1)>0.5, since in theory there will be an equal number of random draws of U(0,1) below 0.5 and above 0.5.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio