I'm writing a REST API that returns products in JSON from a Postgres database.
I have written integration tests to test which products are returned and this works fine. A requirement has just been introduced to randomly order the products returned.
I've changed my tests to not rely on the order the results come back in. My problem is testing the new random requirement.
I plan on implementing this in the database with Postgres' RANDOM() keyword. If I was doing this "in code" I could stub the random code generator to always be the same value, but I'm not sure what to do in the database.
How can I test that my new random requirement is working?
I've found a way of doing what I need.
You can set the seed value for Postgres using SETSEED().
If you set the seed before the you execute the query that uses RANDOM(), the results will come back in the same order every time.
SELECT SETSEED(0.5);
SELECT id, title FROM products ORDER BY RANDOM() LIMIT 2;
The seed value is reset after the SELECT query.
To test that the data comes back random we can change the seed value.
I don't want to test if Postgres' RANDOM() works, but that my code that uses it does.
That will depend on your definition of randomness. As a first try you could issue the same request twice and make sure that the same result set is returned but in a different order. This of course assumes that your test data will not page or some such, but if it does your test will of course be more difficult, as you would probably have to retrieve all pages in order to verify anything.
On second thoughts paging would probably complicate the whole request, as it would require having the same randomness across several pages.
IMHO if you want to test randomness, you should find a query that return few results - 2 would be the ideal number.
You then run the query a big number of times and count the occurences of the different ordering possibilities. The number must not reach same value, but the frequencies should converge toward 1/n, where n is the number of orderings. But in fact, you do not want the quality of the random generator, all you need is to be sure that you correctly use it. So you should only test that you get one of each possibility for a correct number of tests.
I would use 100 run if n <= 10 and n2 if n > 10. For n = 10 and 100 runs the probability of having one possibility off is less than 3e-5. So run the test once, and run it again if it fails, and it should be enough. Of course, if you want to reduce the risk of false detection simply augment the number of runs ... but tests will be longer ...
Related
I know that the "how to generate random number" in solidity is a very common question. However, after reading the great majority of answers I did not find one to fit my case.
A short description of what I want to do is: I have a list of objects that each have a unique id, a number. I need to produce a list that contains 25% of those objects, randomly selected each time the function is called. The person calling the function cannot be depended on to provide input that will somehow influence predictably the resulting list.
The only answer I found that gives a secure random number was Here. However, it depends on input coming from the participants and it is meant to address a gambling scenario. I cannot use it in my implementation.
All other cases mention that the number generated is going to be predictable, and even some of those depend on a singular input to produce a single random number. Once again, does not help me.
Summarising, I need a function that will give me multiple, non-predictable, random numbers.
Thanks for any help.
Here is an option:
function rand()
public
view
returns(uint256)
{
uint256 seed = uint256(keccak256(abi.encodePacked(
block.timestamp + block.difficulty +
((uint256(keccak256(abi.encodePacked(block.coinbase)))) / (now)) +
block.gaslimit +
((uint256(keccak256(abi.encodePacked(msg.sender)))) / (now)) +
block.number
)));
return (seed - ((seed / 1000) * 1000));
}
It generates a random number between 0-999, and basically it's impossible to predict it (It has been used by some famous Dapps like Fomo3D).
Smart Contracts are deterministic, so, basically every functions are predictable - if we know input, we will be and we should be know output. And you cannot get random number without any input - almost every language generates "pseudo random number" using clock. This means, you will not get random number in blockchain using simple method.
There are many interesting methods to generate random number using Smart Contract - using DAO, Oracle, etc. - but they all have some trade-offs.
So in conclusion, There is no method you are looking for. You need to sacrifice something.
:(
100% randomness is definitely impossible on Ethereum. The reason for that is that when distributed nodes are building from the scratch the blockchain they will build the state by running every single transaction ever created on the blockchain, and all of them have to achieve the exact same final status. In order to do that randomness is totally forbidden from the Ethereum Virtual Machine, since otherwise each execution of the exact same code would potentially yield a different result, which would make impossible to reach a common final status among all participants of the network.
That being said, there are projects like RanDAO that pretend to create trustable pseudorandomness on the blockchain.
In any case, there are approaches to achieve pseudandomness, being two of the most important ones commit-reveal techniques and using an oracle (or a combination of both).
As an example that just occurred to me: you could use Oraclize to call from time to time to a trusted external JSON API that returns pseudorandom numbers and verify on the contract that the call has truly been performed.
Of course the downside of these methods is that you and/or your users will have to spend more gas executing the smart contracts, but it's in my opinion a fair price for the huge benefits in security.
I created a counter that goes up from 0 to 9999 until it resets again. I use the output of this counter as a value to make unique entries. However, the application needs to find its last created number each time the application is restarted. Therfore I am looking for a method which avoids any sort of object storage and relies solely on random number generation.
Something like:
int randomTimeBasedGenerator() {
Random r = new Random(System.currentTimeMillis())
int num = r.nextInt() % 9999
return num
}
But what guarantee do I have that this method generates unique numbers? And, if not, how long would it remain unique? Are there any study papers I can look into for this sort of scenario?
Random number generation would be an elegant solution for my situation, if I can at least guarantee it won't repeat within a couple of weeks or months. But random number generation would be useless in my case if no such guarantee exists.
You have no guarantee that the return value of a random number generator remains unique. Random number generators generate unique sequences of numbers, not unique numbers. Random numbers will always repeat themselves, sooner or later.
As suggested by #Thilo, UUIDs are unique numbers. But an even better approach in your case might be to set up a lightweight database (sqlite will do) and add a record to a table with incremental id's. It is not possible to keep track of a process without storing values somewhere.
I'm attempting to estimate the total amount of results for app engine queries that will return large amounts of results.
In order to do this, I assigned a random floating point number between 0 and 1 to every entity. Then I executed the query for which I wanted to estimate the total results with the following 3 settings:
* I ordered by the random numbers that I had assigned in ascending order
* I set the offset to 1000
* I fetched only one entity
I then plugged the entities's random value that I had assigned for this purpose into the following equation to estimate the total results (since I used 1000 as the offset above, the value of OFFSET would be 1000 in this case):
1 / RANDOM * OFFSET
The idea is that since each entity has a random number assigned to it, and I am sorting by that random number, the entity's random number assignment should be proportionate to the beginning and end of the results with respect to its offset (in this case, 1000).
The problem I am having is that the results I am getting are giving me low estimates. And the estimates are lower, the lower the offset. I had anticipated that the lower the offset that I used, the less accurate the estimate should be, but I thought that the margin of error would be both above and below the actual number of results.
Below is a chart demonstrating what I am talking about. As you can see, the predictions get more consistent (accurate) as the offset increases from 1000 to 5000. But then the predictions predictably follow a 4 part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).
Am I making a mistake here, or does the standard python random number generator not distribute numbers evenly enough for this purpose?
Thanks!
Edit:
It turns out that this problem is due to my mistake. In another part of the program, I was grabbing entities from the beginning of the series, doing an operation, then re-assigning the random number. This resulted in a denser distribution of random numbers towards the end.
I did a little more digging into this concept, fixed the problem, and tried it again on a different query (so the number of results are different from above). I found that this idea can be used to estimate the total results for a query. One thing of note is that the "error" is very similar for offsets that are close by. When I did a scatter chart in excel, I expected the accuracy of the predictions at each offset to "cloud". Meaning that offsets at the very begging would produce a larger, less dense cloud that would converge to a very tiny, dense could around the actual value as the offsets got larger. This is not what happened as you can see below in the cart of how far off the predictions were at each offset. Where I thought there would be a cloud of dots, there is a line instead.
This is a chart of the maximum after each offset. For example the maximum error for any offset after 10000 was less than 1%:
When using GAE it makes a lot more sense not to try to do large amounts work on reads - it's built and optimized for very fast requests turnarounds. In this case it's actually more efficent to maintain a count of your results as and when you create the entities.
If you have a standard query, this is fairly easy - just use a sharded counter when creating the entities. You can seed this using a map reduce job to get the initial count.
If you have queries that might be dynamic, this is more difficult. If you know the range of possible queries that you might perform, you'd want to create a counter for each query that might run.
If the range of possible queries is infinite, you might want to think of aggregating counters or using them in more creative ways.
If you tell us the query you're trying to run, there might be someone who has a better idea.
Some quick thought:
Have you tried Datastore Statistics API? It may provide a fast and accurate results if you won't update your entities set very frequently.
http://code.google.com/appengine/docs/python/datastore/stats.html
[EDIT1.]
I did some math things, I think the estimate method you purposed here, could be rephrased as an "Order statistic" problem.
http://en.wikipedia.org/wiki/Order_statistic#The_order_statistics_of_the_uniform_distribution
For example:
If the actual entities number is 60000, the question equals to "what's the probability that your 1000th [2000th, 3000th, .... ] sample falling in the interval [l,u]; therefore, the estimated total entities number based on this sample, will have an acceptable error to 60000."
If the acceptable error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806]
I think the probability won't be very large.
This doesn't directly deal with the calculations aspect of your question, but would using the count attribute of a query object work for you? Or have you tried that out and it's not suitable? As per the docs, it's only slightly faster than retrieving all of the data, but on the plus side it would give you the actual number of results.
http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count
When a user adds a new item in my system, I want to produce a unique non-incrementing pseudo-random 7-digit code for that item. The number of items created will only number in the thousands (<10,000).
Because it needs to be unique and no two items will have the same information, I could use a hash, but it needs to be a code they can share with other people - hence the 7 digits.
My original thought was just to loop the generation of a random number, check that it wasn't already used, and if it was, rinse and repeat. I think this is a reasonable if distasteful solution given the low likelihood of collisions.
Responses to this question suggest generating a list of all unused numbers and shuffling them. I could probably keep a list like this in a database, but we're talking 10,000,000 entries for something relatively infrequent.
Does anyone have a better way?
Pick a 7-digit prime number A, and a big prime number B, and
int nth_unique_7_digit_code(int n) {
return (n * B) % A;
}
The count of all unique codes generated by this will be A.
If you want to be more "secure", do pow(some_prime_number, n) % A, i.e.
static int current_code = B;
int get_next_unique_code() {
current_code = (B * current_code) % A;
return current_code;
}
You could use an incrementing ID and then XOR it on some fixed key.
const int XORCode = 12345;
private int Encode(int id)
{
return id^XORCode;
}
private int Decode(int code)
{
return code^XORCode;
}
Honestly, if you want to generate only a couple of thousand 7-digit codes, while 10 million different codes will be available, I think just generating a random one and checking for a collision is good enough.
The chance of a collision on the first hit will be, in the worst case scenario, about 1 in a thousand, and the computational effort to just generate a new 7-digit code and check for a collision again will be much smaller than keeping a dictionary, or similar solutions.
Using a GUID instead of a 7-digit code as harryovers suggested will also certainly work, but of course a GUID will be slightly harder to remember for your users.
i would suggest using a guid instead of a 7 digit code as it will be more unique and you don't have to worry about generateing them as .NET will do this for you.
All solutions for a "unique" ID must have a database somewhere: Either one which contains the used IDs or one with the free IDs. As you noticed, the database with free IDs will be pretty big so most often, people use a "used IDs" database and check for collisions.
That said, some databases offer a "random ID" generator/sequence which already returns IDs in a range in random order.
This works by using a random number generator which can create all numbers in a range without repeating itself plus the feature that you can save it's state somewhere. So what you do is run the generator once, use the ID and save the new state. For the next run, you load the state and reset the generator to the last state to get the next random ID.
I assume you'll have a table of the generated ones. In that case, I don't see a problem with picking random numbers and checking them against the database, but I wouldn't do it individually. Generating them is cheap, doing the DB query is expensive relative to that. I'd generate 100 or 1,000 at a time and then ask the DB which of those exists. Bet you won't have to do it twice most of the time.
You have <10.000 items, so you need only 4 digits to store a unique number for all items.
Since you have 7 digits, you have 3 digits extra.
If you combine a unique sequence number of 4 digits with a random number of 3 digits, you will be unique and random. You increment the sequence number with every new ID you generate.
You can just append them in any order, or mix them.
seq = abcd,
rnd = ABC
You can create the following ID's:
abcdABC
ABCabcd
aAbBcCd
If you use only one mixing algorithm, you will have unique numbers, that look random.
I would try to use an LFSR (Linear feedback shift register) the code is really simple you can find examples everywhere ie Wikipedia and even though it's not cryptographically secure it looks very random. Also the implementation will be very fast since it's using mainly shift operations.
With only thousands of items in the database, your original idea seems sound. Checking the existance of a value in a sorted (indexed) list of a few tens of thousands of items would only require a few data fetches and comparisons.
Pre-generating the list doesn't sound like a good idea, because you will either store way more numbers than are necessary, or you will have to deal with running out of them.
Probability of having hits is very low.
For instance - you have 10^4 users and 10^7 possible IDs.
Probability that you pick used ID 10 times in row is now 10^-30.
This chance is lower than once in a lifetime of any person.
Well, you could ask the user to pick their own 7-digit number and validate it against the population of existing numbers (which you would have stored as they were used up), but I suspect you would be filtering a lot of 1234567, 7654321, 9999999, 7777777 type responses and might need a few RegExs to achieve the filtering, plus you'd have to warn the user against such sequences in order not to have a bad, repetitive, user input experience.
As much as I like using GUIDs as the unique identifiers in my system, it is not very user-friendly for fields like an order number where a customer may have to repeat that to a customer service representative.
What's a good algorithm to use to generate order number so that it is:
Unique
Not sequential (purely for optics)
Numeric values only (so it can be easily read to a CSR over phone or keyed in)
< 10 digits
Can be generated in the middle tier without doing a round trip to the database.
UPDATE (12/05/2009)
After carefully reviewing each of the answers posted, we decided to randomize a 9-digit number in the middle tier to be saved in the DB. In the case of a collision, we'll regenerate a new number.
If the middle tier cannot check what "order numbers" already exists in the database, the best it can do will be the equivalent of generating a random number. However, if you generate a random number that's constrained to be less than 1 billion, you should start worrying about accidental collisions at around sqrt(1 billion), i.e., after a few tens of thousand entries generated this way, the risk of collisions is material. What if the order number is sequential but in a disguised way, i.e. the next multiple of some large prime number modulo 1 billion -- would that meet your requirements?
<Moan>OK sounds like a classic case of premature optimisation. You imagine a performance problem (Oh my god I have to access the - horror - database to get an order number! My that might be slow) and end up with a convoluted mess of psuedo random generators and a ton of duplicate handling code.</moan>
One simple practical answer is to run a sequence per customer. The real order number being a composite of customer number and order number. You can easily retrieve the last sequence used when retriving other stuff about your customer.
One simple option is to use the date and time, eg. 0912012359, and if two orders are received in the same minute, simply increment the second order by a minute (it doesn't matter if the time is out, it's just an order number).
If you don't want the date to be visible, then calculate it as the number of minutes since a fixed point in time, eg. when you started taking orders or some other arbitary date. Again, with the duplicate check/increment.
Your competitors will glean nothing from this, and it's easy to implement.
Maybe you could try generating some unique text using a markov chain - see here for an example implementation in Python. Maybe use sequential numbers (rather than random ones) to generate the chain, so that (hopefully) the each order number is unique.
Just a warning, though - see here for what can possibly happen if you aren't careful with your settings.
One solution would be to take the hash of some field of the order. This will not guarantee that it is unique from the order numbers of all of the other orders, but the likelihood of a collision is very low. I would imagine that without "doing a round trip to the database" it would be challenging to make sure that the order number is unique.
In case you are not familiar with hash functions, the wikipedia page is pretty good.
You could base64-encode a guid. This will meet all your criteria except the "numeric values only" requirement.
Really, though, the correct thing to do here is let the database generate the order number. That may mean creating an order template record that doesn't actually have an order number until the user saves it, or it might be adding the ability to create empty (but perhaps uncommitted) orders.
Use primitive polynomials as finite field generator.
Your 10 digit requirement is a huge limitation. Consider a two stage approach.
Use a GUID
Prefix the GUID with a 10 digit (or 5 or 4 digit) hash of the GUID.
You will have multiple hits on the hash value. But not that many. The customer service people will very easily be able to figure out which order is in question based on additional information from the customer.
The straightforward answer to most of your bullet points:
Make the first six digits a sequentially-increasing field, and append three digits of hash to the end. Or seven and two, or eight and one, depending on how many orders you envision having to support.
However, you'll still have to call a function on the back-end to reserve a new order number; otherwise, it's impossible to guarantee a non-collision, since there are so few digits.
We do TTT-CCCCCC-1A-N1.
T = Circuit type (D1E=DS1 EEL, D1U=DS1 UNE, etc.)
C = 6 Digit Customer ID
1 = The customer's first location
A = The first circuit (A=1, B=2, etc) at this location
N = Order type (N=New, X=Disconnect, etc)
1 = The first order of this kind for this circuit