How does Polkadot's VRF achieve randomness to shuffle validators? - random

Contrary to Ethereum which uses RANDAO (possibly enhanced with VDF), in Polkadot, a verifiable random function (VRF) is used to shuffle validators and select potential block proposers for certain slots. Where does the randomness come from, i.e. how does the randomness work specifically?

A verifiable random function is a function that, in pseudocode, can be expressed like so:
(RESULT, PROOF) = VRF(SECRET, INPUT)
That is, for some secret and some input (which can be public), the result is a tuple of RESULT and PROOF, where PROOF can be used by outside observers to verify legitimacy of the VRF RESULT.
In other words, making a "VRF roll" results in a random number and proof that you got that random number, and didn't just pick it.
Every slot (approx. every 6 seconds) every validator will run the VRF function. The SECRET will be their VRF key, a special key to be used only for this, generated by the validator and kept secret. The INPUT is either a specific value from the genesis block if fewer than 2 epochs exist in the chain, or a hash of all the VRF results in the past 2 epochs.
Once a validator has executed the VRF, the RESULT is compared to a THRESHOLD value which is defined by the protocol. If the RESULT is less than THRESHOLD, the validator is a valid block-proposer candidate for that slot. Otherwise, the validator skips that slot.
This means it is possible for there to be multiple validators who are block producing candidates for a slot, in which case the block that gets picked up by other nodes is the one that prevails, as long as it's on the chain with the most recent finalized block as per the GRANDPA finality gadget. A situation in which no block producers exist for a slot is also possible, in which case the AURA consensus will take over. The AURA consensus is basically a fallback which picks a random validator for each block. It runs in parallel to BABE, and only matters when a slot has no block producers, otherwise it's ignored.

Related

Solidity - Generate unpredictable random number that does not depend on input

I know that the "how to generate random number" in solidity is a very common question. However, after reading the great majority of answers I did not find one to fit my case.
A short description of what I want to do is: I have a list of objects that each have a unique id, a number. I need to produce a list that contains 25% of those objects, randomly selected each time the function is called. The person calling the function cannot be depended on to provide input that will somehow influence predictably the resulting list.
The only answer I found that gives a secure random number was Here. However, it depends on input coming from the participants and it is meant to address a gambling scenario. I cannot use it in my implementation.
All other cases mention that the number generated is going to be predictable, and even some of those depend on a singular input to produce a single random number. Once again, does not help me.
Summarising, I need a function that will give me multiple, non-predictable, random numbers.
Thanks for any help.
Here is an option:
function rand()
public
view
returns(uint256)
{
uint256 seed = uint256(keccak256(abi.encodePacked(
block.timestamp + block.difficulty +
((uint256(keccak256(abi.encodePacked(block.coinbase)))) / (now)) +
block.gaslimit +
((uint256(keccak256(abi.encodePacked(msg.sender)))) / (now)) +
block.number
)));
return (seed - ((seed / 1000) * 1000));
}
It generates a random number between 0-999, and basically it's impossible to predict it (It has been used by some famous Dapps like Fomo3D).
Smart Contracts are deterministic, so, basically every functions are predictable - if we know input, we will be and we should be know output. And you cannot get random number without any input - almost every language generates "pseudo random number" using clock. This means, you will not get random number in blockchain using simple method.
There are many interesting methods to generate random number using Smart Contract - using DAO, Oracle, etc. - but they all have some trade-offs.
So in conclusion, There is no method you are looking for. You need to sacrifice something.
:(
100% randomness is definitely impossible on Ethereum. The reason for that is that when distributed nodes are building from the scratch the blockchain they will build the state by running every single transaction ever created on the blockchain, and all of them have to achieve the exact same final status. In order to do that randomness is totally forbidden from the Ethereum Virtual Machine, since otherwise each execution of the exact same code would potentially yield a different result, which would make impossible to reach a common final status among all participants of the network.
That being said, there are projects like RanDAO that pretend to create trustable pseudorandomness on the blockchain.
In any case, there are approaches to achieve pseudandomness, being two of the most important ones commit-reveal techniques and using an oracle (or a combination of both).
As an example that just occurred to me: you could use Oraclize to call from time to time to a trusted external JSON API that returns pseudorandom numbers and verify on the contract that the call has truly been performed.
Of course the downside of these methods is that you and/or your users will have to spend more gas executing the smart contracts, but it's in my opinion a fair price for the huge benefits in security.

How does the function add_disk_randomness work?

I am trying to understand the add_disk_randomness function from the linux kernel. I read a few papers, but they don't describe it very well. This code is from /drivers/char/random.c:
add_timer_randomness(disk->random, 0x100 + disk_devt(disk));
disk_devt(disk) holds the variable disk->devt. I understand it as a unique block device number. But what are the major and minor device number?
Then the block device number and the hex 0x100 are added together.
Then we collect also the time value disk->random. Is this the seek time for each block?
These two values will be passed to the function add_timer_randomness. It would be nice to get an example with values.
The first parameter of add_timer_randomness is a pointer to struct timer_rand_state. You can confirm this by checking struct gendisk.
timer_rand_state from random.c is reproduced below
/* There is one of these per entropy source */
struct timer_rand_state {
cycles_t last_time;
long last_delta, last_delta2;
};
This struct stores the timestamp of the last input event as well as previous "deltas". add_timer_randomness first gets the current time (measured in jiffies), then reads last_time (also in jiffies), then overwrites last_time with the first value.
The first, second, and third order "deltas" are tracked as a means of estimating entropy. The main source of entropy from hard disk events is the timing of those events. More data is hashed into the entropy pool, but they don't contribute to entropy estimates. (It is important not to over estimate how unpredictable the data you hash in is. Otherwise your entropy pool, and therefore your RNG output too, may be predictable. Underestimating entropy on the other hand cannot make RNG output more predictable. It is always better to use a pessimistic estimator in this respect. That is why data that doesn't contribute to entropy estimates are still hashed into the entropy pool.)
Delta is the time between two events. (The difference between timestamps.) The second order delta is the difference between the times between two events. (Difference between deltas.) Third order deltas is differences between second order deltas. The timer_rand_state pointer is the memory location that tracks the previous timestamp and deltas. delta3 does not need to be stored.
The entropy estimate from this timing data is based on the logarithm of the largest absolute value of deltas one, two, and three. (Not exactly the logarithm. It's always an integer, for example. It's always rounded down by one bit. And if the value you're taking the almost-logarithm of is zero the result is also zero.)
Say you have a device used as an entropy source that generates a new events every 50 milliseconds. The delta will always be 50ms. The second order delta is always zero. Since one of the three deltas is zero this prevents this device's timings from being relied on as a significant entropy source. The entropy estimator successfully fails to overestimate input entropy, so even if this device is used as an entropy source it won't "poison" the entropy pool with predictability.
The entropy estimate isn't based on any formal mathematics. We can't construct an accurate model of the entropy source because we don't know what it is. We don't know what the hardware on a user's computer will be exactly or exactly how it will behave in an unknown environment. We just want to know that if we add one to the (estimated) entropy counter then we've hashed at least one bit of entropy worth of unpredictable data into the entropy pool. Extra data besides just the timings is hashed into the pool without increasing the entropy counter, so we hope that if the timer-based entropy estimator some time over estimates then maybe there is some unpredictability in the non-timer-based source we didn't account for. (And if that's the case your RNG is still safe.)
I'm sure that sounds unconvincing, but I don't know how to help that. I tried my best to explain the relevant parts of the random.c code. Even if I could mind meld and provide some intuition for how the process works it probably would still be unsatisfying.

Algorithm for concurrent queue (single consumer, multiple producers) based on shared dictionary data structure

I am looking for a queue algorithm that fulfills the following properties:
Processes communicate using only a shared dictionary (key-value-store)
Does not use any atomic operations other than load and store (no CAS, for example)
Supports multiple producers
Supports a single consumer
Producers can die at any time and queue must remain operational
The consumer can also die at any time and be restarted later, but there will never be more than one consumer-process running at a time
This is meant as a general question about a suitable algorithm, since I'd like to use it in a couple of different scenarios. But to help visualize the requirements, here is an example use-case:
I have a website with two pages: producer.html and consumer.html
producer.html can be opened in multiple tabs simultaneously
Each producer.html adds events to the queue
One copy of consumer.html is open and consumes these events (to aggregate and stream them to a webserver, for example)
If the multiple producer-tabs are opened by the user rather than the page, these tabs do not have references to each other available, so the usual communication methods (postMessage or calling directly into the other tab's JS code) are out. One of the ways they can still communicate with each other is via LocalStorage as suggested here: Javascript; communication between tabs/windows with same origin. But LocalStorage is not "thread-safe" as detailed here.
Note: There may be other ways to implement cross-tab communication in the browser (Flash, ...), but these are NOT the aim of this question as they won't translate to my other use-cases. This is really just an example use-case for the general queue algorithm that I am trying to find.
A couple more parameters:
The number of producers will never be very large (10s or 100s maybe), so the scaling of the number of reads and writes needed with respect to the number of producers is not really a concern.
I don't know before hand how many producers I might have and there is no immediately obvious way to assign a number or index to them. (Many mutex algorithms (Lamport's Bakery, Eisenberg&McGuire, Szymański's, ...) maintain an array of state for each process, which wouldn't necessarily be a natural approach here, although I do not want to exclude these approaches ex ante, if they can be implemented using the shared dictionary in some way...)
The algorithm should be 100% reliable. So, I'd like to avoid things like the delay in Lamport's first Fast Mutex algorithm (page 2 in the PDF) since I don't have any kind of real-time guarantees.
It would be very helpful if the queue was FIFO, but it's not strictly required.
The algorithm should not be encumbered by any patents, etc.
Update:
The Two-Lock Concurrent Queue Algorithm by Michael and Scott looks like it could work, but I would need two things to implement it:
A locking mechanism using the shared dictionary that can survive the crash of a lock-holder
A reliable way to allocate a new node (if I move the allocation into the locked section, I could just generate new random keys until I find one that's not in use yet, but there might be a better way?)
Update 2:
It seems, I wasn't being specific enough about the dictionary:
It's really nothing more than a trivial key-value-store. It provides the functions get(key) to read the value of a key, put(key, value) to change the value of a key, and delete(key) to remove a key. In some of my use-cases, I can also iterate over keys, but if possible, I'd like to avoid it for generality. Keys are arbitrary and the producers and consumers can create or calculate them as needed. The dictionary does not provide any facilities for automatically generating unique keys.
Examples are HTML LocalStorage, Google AppEngine's Datastore, a Java Map, a Python dictionary, or even a file-system with only a single directory (where the keys would be the file-names and the values the content of the files).
After quite a bit of further reading and sleeping on things for a night, I came up with one way that should be able to accomplish what I need, but it might not be the most elegant:
The paper Wait-Free Algorithms for Fast, Long-Lived Renaming by Moir and Anderson generalizes Lamport's Fast Mutex Algorithm #2 (page 6 here) into the following building block (Figure 2):
When n processes enter this section of code, at most one of them will stop, at most n-1 will move right and at most n-1 will move down.
In Lamport's algorithm, stopping means the process acquired the lock, whereas moving right or left will simply send the process back to the beginning of this section of code. To release the lock, a process simply sets Y back to false. (Not quite correct, actually... See "Update" below...)
The big problem with this is that if any of the processes ever die while holding the lock (i.e. before releasing it), the block will simply stay locked forever.
Another problem is that every process needs to be assigned a unique process ID p.
The locked-forever problem can be fixed by borrowing an idea from Moir and Anderson, namely to send processes that end up moving right or down into a different building block rather than back to this one, leading to a structure like this (Figure 3 in the paper):
Except that in this case, I won't be using this grid to assign process IDs as M&A did (although I could probably solve the problem of the unique values for p with this). Instead, every box in the grid will correspond to a very simple queue. If a process stops on a box, it acquired the tail-lock for the corresponding queue (e.g. as per the algorithm by Michael and Scott) and proceeds to enqueue a new element to that queue. Upon completion, it sets the Y value of the box back to false to allow other processes to use this queue. This way, if there is high contention or if processes die before releasing locks, new queues will be created dynamically as needed.
The consumer-process doesn't need to worry about locking the heads of the queues when dequeuing elements, since it's the only process to ever do so. So, it simply traverses the tree of boxes to find all queues and trivially helps itself to their contained elements. One thing to note is that while each individual queue will be FIFO, there is no synchronization between the queues, so the combined queue will not necessarily be FIFO.
If we now change the boolean Y to a time-stamp (or null/0 to indicate false), the consumer can also expire locks after some safe timeout to re-activate dead queues.
A note about implementation using the dictionary:
The shared variables X and Y can be entries in the dictionaries with key-names X_123 and Y_123, where 123 is the number of the box.
p can simply be any unique random string and will be stored as the value of key X_123.
The boolean or time-stamp is also simply stored as the value of key Y_123. The producer-processes interpret a missing entry for Y_123 as false or null/0.
The box-numbers 123 need to be calculated from the move-pattern. One way to do this would be to start with 1 in the top-left corner. If the process stops in that box, we're done. If not, the current number (starting with 1) is shifted left by 1 (i.e. multiplied by 2) and, if the process moved down, also incremented by 1. Smaller (and fewer) numbers can be calculated with a different numbering scheme (I still need to work it out), but this one should work.
The queues then consist of one entry with key H_123 that holds the index of the current head of the queue in its value and one entry with key T_123 that holds the index of the tail. Both default to 0 if they don't exist.
To enqueue an item into queue 123, the tail index is read from T_123 (let's say it yields 48) and an entry with key Q_123_48 is put into the dictionary with its value containing the enqueued item. After, T_123 is incremented by 1.
After the item is enqueued, the Y_123 entry is set back to false or null/0 (not deleted!)
To dequeue an item, the head index is read from H_123 (let's say it yields 39) and compared to the tail index T_123. If it is smaller, an item is available at Q_123_39, which is then read and deleted from the dictionary. After, H_123 is incremented by 1.
To traverse the box-tree, the consumer starts with the box in the top left corner. For each box (e.g. 123), if a key Y_123 exists in the dictionary (even if it contains values null/0 or false), the consumer dequeues items from the corresponding queue, and then recursively moves right and down to the adjacent boxes. If no key Y_123 exists, this box hasn't been used by any processes yet and doesn't need to be considered (and neither do the boxes below or to its right).
I haven't actually implemented this yet, but I'll do that next. I just wanted to post this already to see if it could inspire other approaches or if anyone can see anything wrong with this idea.
Update:
I just noticed one complication: It is possible that if two processes are trying to acquire the lock for a queue simultaneously, both will fail and move on to the next block. This will leave that queue locked forever as no-one will be left to set Y back to false or null/0.
This is the reason why the "Long-Lived Renaming" algorithm by M&A as well as Lamport's algorithm #2 use an array of Y-values in which every process has its own entry that it resets also if it moves on to another block. Y is then only considered false if all entries are false.
Since I don't know before-hand how many processes I will have, I could implement this only if the dictionary had some way of enumerating keys (the keys would then be Y_123_456 where 456 is the value of p for each process).
But, with rare contention and the above described timeout-mechanism for reactivating dead queues, the issue might lead to only a little bit of memory inefficiency, rather than a major problem.
Update 2:
A better way to label the boxes would be this pattern:
If we call the total number of moves n (counting the move into the top left box also, i.e. n ≥ 1) and the number of moves to the right r, then the box-number can be calculated using
box = (n × (n - 1))/2 + r
Just use a RDBMS. It's pretty simple in MS SQL, for PostgresSQL you'd have to use the RETURNING keyword and for MySQL you'd probably have to use triggers.
CREATE TABLE Q ([Key] BIGINT IDENTITY(1,1) PRIMARY KEY, [Message] NVARCHAR(4000))
INSERT INTO Q OUTPUT inserted.* VALUE(#message)
DELETE TOP(1) Q WITH (READPAST) OUTPUT deleted.*
If you were really hoping for an algorithmic solution, just use a ring buffer.
const int MAX_Q_SIZE = 20000000;
static string[] Q = new string[MAX_Q_SIZE];
static long ProducerID = 0;
static long ConsumerID = 0;
public static long Produce(string message) {
long key = Interlocked.Increment(ref ProducerID);
int idx = (int)(key % MAX_Q_SIZE);
Q[idx] = message;
return key;
}
public static string Consume() {
long key = Interlocked.Increment(ref ConsumerID);
int idx = (int)(key % MAX_Q_SIZE);
string message = Q[idx];
return message;
}

Comparing secret data without giving away source

Issue:
Company A has secret data they don't want to give away to company B.
Company B has secret data they don't want to give away to company A.
The secret data is IP addresses on both sides.
But the two companies want to know the number of overlapping IPs they have (IP addresses that both companies have in the database).
Without using a third party I can't think of a way to solve this issue without one party compromising their secret data set. Is there any type of hashing algo written to solve this problem?
First I'll describe a simple but not very secure idea. Then I'll describe a way that I think it can be easily made much more secure. The basic idea is to have each company send an encoding of a one-way function to the other company.
Sending Programs
As a warm-up, let's first suppose that one company (let's say A) develops an ordinary computer program in some language and sends it to B; B will then run it, supplying its own list of email addresses as input, and the program will report how many of them are also used by A. At this point, B knows how many email addresses it shares with A. Then the process can be repeated, but with the roles of A and B reversed.
Sending SAT Instances
Implementing this program straightforwardly in a normal programming language would yield a program that is almost trivially easy to reverse-engineer. To mitigate this, first, instead of having the program report the count directly, let's reformulate the problem as a decision problem: Does the other company have at least k of the emails in the input? (This involves choosing some value k to test for; of course, if both parties agree then the whole procedure can be performed for many different values of k. (But see the last section for possible ramifications.)) Now the program can be represented instead as a SAT instance that takes as input (some bitstring encoding of) a list of email addresses, and outputs a single bit that indicates whether k or more of them also belong to the company that created the instance.
It's computationally easy to supply inputs to a SAT instance and read off the output bit, but when the instance is large, it's (in principle) very difficult to go in "the other direction" -- that is, to find a satisfying assignment of inputs, i.e., a list of email addresses that will drive the output bit to 1: SAT being an NP-hard problem, all known exact techniques take time exponential in the problem size.
Making it Harder with Hashing
[EDIT: Actually there are many more than (n choose k) possible hashes to be ORed together, since any valid subsequence (with gaps allowed) in the list of email addresses that contains at least k shared ones needs to turn the output bit on. If each email address takes at most b bits, then there are much more than 2^((n-k)b)*(n choose k) possibilities. It's probably only feasible to sample a small fraction of them, and I don't know if unsampled ones can be somehow turned into "don't-cares"...]
The SAT instance I propose here would certainly be very large, as it would have to be a disjunction (OR) of all (n choose k) possible allowed bitstrings. (Let's assume that email addresses are required to be listed in some particular order, to wipe off an n-factorial factor.) However it has a very regular structure that might make it amenable to analysis that could dramatically reduce the time required to solve it. To get around this, all we need to do is to require the receiver to hash the original input and supply this hash value as input instead. The resulting SAT instance will still look like the disjunction (OR) of (n choose k) possible valid bitstrings (which now represent hashes of lists of strings, rather than raw lists of strings) -- but, by choosing a hash size large enough and applying some logic minimisation to the resulting instance, I'm confident that any remaining telltale patterns can be removed. (If anyone with more knowledge in the area can confirm or deny this, please edit or comment.)
Possible Attacks
One weakness of this approach is that nothing stops the receiver from "running" (supplying inputs to) the SAT instance many times. So, choosing k too low allows the receiver to easily isolate the email addresses shared with the sender by rerunning the SAT instance many times using different k-combinations of their own addresses, and dummy values (e.g. invalid email addresses) for the remaining input bits. E.g. if k=2, then the receiver can simply try running all n^2 pairs of its own email addresses and invalid email addresses for the rest until a pair is found that turns the output bit on; either of these email addresses can then be paired with all remaining email addresses to detect them in linear time.
You should be able to use homomorphic encryption to carry out the computation. I imagine creating something like bitmasks on both sites, performing encryption, then performing a XOR of the result. I think this source points to some information on what encryption you can perform that supports XOR.

What is a seed in relation to a random number generation algorithm and why is computer time used to create this seed more often than not?

I read that seeds are used to initialize random number generators. But seems like the randomness of the seed doesn't matter much for getting good randomness from the generator. So I want to understand what is a seed actually? Why is it called so? And lastly why time in a computer system is used to generate such seeds?
A pseudo-random number generator produces a sequence of numbers. It isn't truly random, but generally a mathematical calculation which produces an output that matches some desirable distribution, and without obvious patterns. In order to produce such a sequence, there must be state stored for the generator to be able to generate the next number in that sequence. The state is updated each time using some part of the output from the previous step.
Seeding explicitly initialises this state. A 'seed' is a starting point, from which something grows. In this case, a sequence of numbers.
This can be used either to always generate the same sequence (by using a known constant seed), which is useful for having deterministic behaviour. This is good for debugging, for some network applications, cryptography, etc.
Or, in situations where you want the behaviour to be unpredictable (always different each time you run a program, a card game perhaps), you can seed with a number likely to be continually changing, such as time.
The 'randomness' of the sequence does not depend on the seed chosen, though it does depend on not reseeding the sequence.

Resources