I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:
SELECT foo FROM mytable WHERE RAND() < n/N
→ This is fast, but doesn't give me exactly n rows (only approximately).
SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n
→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.
I compared the two queries execution times using BigQuery standard SQL with the natality sample dataset (137,826,763 rows) and getting a sample for source_year column of size n. The queries are executed without using cached results.
Query1:
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < n/137826763
Query2:
SELECT source_year, rand() AS r
FROM `bigquery-public-data.samples.natality`
ORDER BY r
LIMIT n
Result:
n Query1 Query2
1000 ~2.5s ~2.5s
10000 ~3s ~3s
100000 ~3s ~4s
1000000 ~4.5s ~15s
For n <= 105 the difference is ~ 1s and for n >= 106 the execution time differ significantly. The cause seems to be that when LIMIT is added to the query, then the ORDER BY runs on multiple workers. See the original answer provided by Mikhail Berlyant.
I thought your propose to combine both queries could be a possible solution. Therefore I compared the execution time for the combined query:
New Query:
SELECT source_year,rand() AS r
FROM (
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < 2*n/137826763)
ORDER BY r
LIMIT n
Result:
n Query1 New Query
1000 ~2.5s ~3s
10000 ~3s ~3s
100000 ~3s ~3s
1000000 ~4.5s ~6s
The execution time in this case vary in <=1.5s for n <= 106. It is a good idea to select n+some_rows rows in the subquery instead of 2n rows, where some_rows is a constant number large enough to get more than n rows.
Regarding what you said about “not guaranteed to work”, I understand that you are worried that the new query doesn’t retrieve exactly n rows. In this case, if some_rows is large enough, it will always get more than n rows in the subquery. Therefore, the query will return exactly n rows.
To summarize, the combined query is not so fast as Query1 but it get exactly n rows and it is faster than the Query2. So, it could be a solution for uniformly random samples. I want to point out that if ORDER BY is not specified, the BigQuery output is non-deterministic, which means you might receive a different result each time you execute the query. If you try to execute the following query several times without using cached results, you will got different results.
SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5
Therefore, depends on how randomly you want to have the samples, this maybe a better solution.
Related
I found here that i can select random nodes from neo4j using next queries:
MATCH (a:Person) RETURN a ORDER BY rand() limit 10
MATCH (a:Person) with a, rand() as rnd RETURN a ORDER BY rnd limit 10
Both queries seems to do the same thing but when I try to match random nodes that are in relationship with a given node then I have different results:
The next query will return always the same nodes (nodes are not randomly selected)
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
return m order by rand() limit 10
...but when I use rand() in a with clause I get indeed random nodes:
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
with m, rand() as rnd
return m order by rnd limit 10
Any idea why rand() behave different in a with clause in the second query but in the first not?
It's important to understand that using rand() in the ORDER BY like this isn't doing what you think it's doing. It's not picking a random number per row, it's ordering by a single number.
It's similar to a query like:
MATCH (p:Person)
RETURN p
ORDER BY 5
Feel free to switch up the number. In any case, it doesn't change the ordering because ordering every row, when the same number is used, doesn't change the ordering.
But when you project out a random number in a WITH clause per row, then you're no longer ordering by a single number for all rows, but by a variable which is different per row.
I need to seed a Neo4j database and have random Persons join random Organizations. I have the following Cypher query:
MATCH (p:Person), (o:Organization)
WITH p, o
WHERE rand() < 0.1
MERGE (p)-[:MEMBER_OF]->(o)
The problem is that this query is giving each person a 10% chance to join all organizations. How can I get this query to generate a random number for every combination of Persons and Organizations?
That's odd that the planner executed it that way. To fix this, let's project out a random number with each combination and do the filtering after:
MATCH (p:Person), (o:Organization)
WITH p, o, rand() as random
WHERE random < 0.1
MERGE (p)-[:MEMBER_OF]->(o)
I have a database with 3.4 millions of nodes and want to select a random node.
I tried using something like
MATCH (n)
WHERE rand() <= 0.01
RETURN n
LIMIT 1
but it seems like the algorithm always starts with the same nodes and selects the first one whose random number is below 0.01, which means in most cases the "random" node is one of the first 100 checked nodes.
Is there a better query, to select a completely random one of all my nodes?
You could generate an random ID from the rand() function and multiply it by the number of nodes. This should generally return a more random node.
MATCH (n)
WHERE id(n) = toInteger(rand() * 3400000)
Once there is some space created within your nodes (i.e. they are no longer perfectly contiguous due to deletes) you might miss a few here and there. In that case you could always range the random number +/- a few on either side and return the first row of the result.
WITH toInteger(rand() * 3400000) AS rand_node, 5 AS offset
WITH range(rand_node - offset, rand_node + offset) AS rand_range
MATCH (n)
WHERE id(n) IN rand_range
RETURN n
LIMIT 1
I'm really confused about this, how to compute general processing time based on its complexity?
the question is:
Let the algorithms A of complexity 0(n^1.5) and B of complexity 0(nlogn) process a list of 100 records for TA(100) = 1 and TB(100) = 20 microseconds, respectively. Find their processing time, TA(n) and TB(n), for n records and decide which of them will process faster a list of n = 100,000,000 records.
Anyone keen to help??
Firstly we can deal with A, It is similar to a quadratic and we must find a constant value of C with the initial conditions we are given; 1 microsecond (1e^-6 seconds, and 100 records.
Therefore
1µ = C * 100^1.5
1µ = 1e^-3 * 100^1.5
C = 1e^-3
We can then substitute in the values of 100,000,000 records and the value we found for c.
This results in a time of 1,000,000,000µs or 1000 seconds.
Try and do this process with the second algorithm in order to tell which one computes the records faster.
I have a table with 4*10^8(roughly) records, and I want to get a 4*10^6(exactly) sample of it.
But my way to get the sample is somehow special:
I select 1 record from the 4*10^8 record randomly(every record has the same probability to be select).
repeat step 1 4*10^6 times(no matter if one record be selected multiple times).
I think up a method to solve this:
Generate a table A(num int), and there only one number in every record of table A which is random integer from 1 to n(n is the size of my original table, roughly 4*10^8 as mentioned above).
Load table A as resource file to every map, and if the ordinal number of the record which is on decision now is in table A, output this record, otherwise discard it.
I think my method is not so good because if I want to sample more record from the original table, the table A will became very large and can't be loaded as resource file.
So, could any one please give an elegant algorithm?
I'm not sure what "elegant" means, but perhaps you're interested in something analogous to reservoir sampling. Let k be the size of the sample and initialize a k-element array with nulls. The elements from which we are sampling arrive one by one. When the jth (counting from 1) element arrives, we iterate through the array and, for each cell, replace its contents by the current element independently with probability 1/j.
Naively, the running time is pretty bad -- to sample k elements from n with replacement costs O(k n). The number of writes into the array, however, is O(k log n) in expectation, because later elements in the stream rarely result in writes. Here's an efficient method based on the exponential distribution (warning: lightly tested Python ahead). The running time is O(n + k log n).
import math
import random
def sample_from(population, k):
for i, x in enumerate(population):
if i == 0:
sample = [x] * k
else:
t = float(k) * math.log(1.0 - 1.0 / float(i + 1))
while True:
t -= math.log(1.0 - random.random())
if t >= 0.0:
break
sample[random.randrange(k)] = x
return sample