Neo4j/Cypher - Randomly match nodes - random

I need to seed a Neo4j database and have random Persons join random Organizations. I have the following Cypher query:
MATCH (p:Person), (o:Organization)
WITH p, o
WHERE rand() < 0.1
MERGE (p)-[:MEMBER_OF]->(o)
The problem is that this query is giving each person a 10% chance to join all organizations. How can I get this query to generate a random number for every combination of Persons and Organizations?

That's odd that the planner executed it that way. To fix this, let's project out a random number with each combination and do the filtering after:
MATCH (p:Person), (o:Organization)
WITH p, o, rand() as random
WHERE random < 0.1
MERGE (p)-[:MEMBER_OF]->(o)

Related

Neo4j: Difference between rand() and rand() in with clause when matching random nodes

I found here that i can select random nodes from neo4j using next queries:
MATCH (a:Person) RETURN a ORDER BY rand() limit 10
MATCH (a:Person) with a, rand() as rnd RETURN a ORDER BY rnd limit 10
Both queries seems to do the same thing but when I try to match random nodes that are in relationship with a given node then I have different results:
The next query will return always the same nodes (nodes are not randomly selected)
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
return m order by rand() limit 10
...but when I use rand() in a with clause I get indeed random nodes:
MATCH (p:Person{user_id: '1'})-[r:REVIEW]->(m:Movie)
with m, rand() as rnd
return m order by rnd limit 10
Any idea why rand() behave different in a with clause in the second query but in the first not?
It's important to understand that using rand() in the ORDER BY like this isn't doing what you think it's doing. It's not picking a random number per row, it's ordering by a single number.
It's similar to a query like:
MATCH (p:Person)
RETURN p
ORDER BY 5
Feel free to switch up the number. In any case, it doesn't change the ordering because ordering every row, when the same number is used, doesn't change the ordering.
But when you project out a random number in a WITH clause per row, then you're no longer ordering by a single number for all rows, but by a variable which is different per row.

Efficient sampling of a fixed number of rows in BigQuery

I have a large dataset of size N, and want to get a (uniformly) random sample of size n. This question offers two possible solutions:
SELECT foo FROM mytable WHERE RAND() < n/N
→ This is fast, but doesn't give me exactly n rows (only approximately).
SELECT foo, RAND() as r FROM mytable ORDER BY r LIMIT n
→ This requires to sort N rows, which seems unnecessary and wasteful (especially if n << N).
Is there a solution that combines the advantages of both? I imagine I could use the first solution to select 2n rows, then sort this smaller dataset, but it's sort of ugly and not guaranteed to work, so I'm wondering whether there's a better option.
I compared the two queries execution times using BigQuery standard SQL with the natality sample dataset (137,826,763 rows) and getting a sample for source_year column of size n. The queries are executed without using cached results.
Query1:
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < n/137826763
Query2:
SELECT source_year, rand() AS r
FROM `bigquery-public-data.samples.natality`
ORDER BY r
LIMIT n
Result:
n Query1 Query2
1000 ~2.5s ~2.5s
10000 ~3s ~3s
100000 ~3s ~4s
1000000 ~4.5s ~15s
For n <= 105 the difference is ~ 1s and for n >= 106 the execution time differ significantly. The cause seems to be that when LIMIT is added to the query, then the ORDER BY runs on multiple workers. See the original answer provided by Mikhail Berlyant.
I thought your propose to combine both queries could be a possible solution. Therefore I compared the execution time for the combined query:
New Query:
SELECT source_year,rand() AS r
FROM (
SELECT source_year
FROM `bigquery-public-data.samples.natality`
WHERE RAND() < 2*n/137826763)
ORDER BY r
LIMIT n
Result:
n Query1 New Query
1000 ~2.5s ~3s
10000 ~3s ~3s
100000 ~3s ~3s
1000000 ~4.5s ~6s
The execution time in this case vary in <=1.5s for n <= 106. It is a good idea to select n+some_rows rows in the subquery instead of 2n rows, where some_rows is a constant number large enough to get more than n rows.
Regarding what you said about “not guaranteed to work”, I understand that you are worried that the new query doesn’t retrieve exactly n rows. In this case, if some_rows is large enough, it will always get more than n rows in the subquery. Therefore, the query will return exactly n rows.
To summarize, the combined query is not so fast as Query1 but it get exactly n rows and it is faster than the Query2. So, it could be a solution for uniformly random samples. I want to point out that if ORDER BY is not specified, the BigQuery output is non-deterministic, which means you might receive a different result each time you execute the query. If you try to execute the following query several times without using cached results, you will got different results.
SELECT *
FROM `bigquery-samples.wikipedia_benchmark.Wiki1B`
LIMIT 5
Therefore, depends on how randomly you want to have the samples, this maybe a better solution.

Best algorithm for netting orders

I am building a marketplace, and I want to build a matching mechanism for market participants orders.
For instance I receive these orders:
A buys 50
B buys 100
C sells 50
D sells 20
that can be represented as a List<Orders>, where Order is a class with Participant,BuySell, and Amount
I want to create a Match function, that outputs 2 things:
A set of unmatched orders (List<Order>)
A set of matched orders (List<MatchedOrder> where MatchOrder has Buyer,Seller,Amount
The constrain is to minimize the number of orders (unmatched and matched), while leaving no possible match undone (ie, in the end there can only be either buy or sell orders that are unmatched)
So in the example above the result would be:
A buys 50 from C
B buys 20 from D
B buys 80 (outstanding)
This seems like a fairly complex algorithm to write but very common in practice. Any pointers for where to look at?
You can model this as a flow problem in a bipartite graph. Every selling node will be on the left, and every buying node will be on the right. Like this:
Then you must find the maximum amount of flow you can pass from source to sink.
You can use any maximum flow algorithms you want, e.g. Ford Fulkerson. To minimize the number of orders, you can use a Maximum Flow/Min Cost algorithm. There are a number of techniques to do that, including applying Cycle Canceling after finding a normal MaxFlow solution.
After running the algorithm, you'll probably have a residual network like the following:
Create a WithRemainingQuantity structure with 2 members: a pointeur o to an order and an integer to store the unmatched quantity
Consider 2 List<WithRemainingQuantity> , 1 for buys Bq, 1 for sells Sq, both sorted by descending quantities of the contained order.
the algo match the head of each queue until one of them is empty
Algo (mix of meta and c++) :
struct WithRemainingQuantity
{
Order * o;
int remainingQty; // initialised with o->getQty
}
struct MatchedOrder
{
Order * orderBuy;
Order * orderSell;
int matchedQty=0;
}
List<WithRemainingQuantity> Bq;
List<WithRemainingQuantity> Sq;
/*
populate Bq and Sq and sort by quantities descending,
this is what guarantees the minimum of matched.
*/
List<MatchedOrder> l;
while( ! Bq.empty && !Sq.empty)
{
int matchedQty = std::min(Bq.front().remainingQty, Sq.front().remainingQty)
l.push_back( MatchedOrder(orderBuy=Bq.front(), sellOrder=Sq.front(), qtyMatched=matchedQty) )
Bq.remainingQty -= matchedQty
Sq.remainingQty -= matchedQty
if(Bq.remainingQty==0)
Bq.pop_front()
if(Sq.remainingQty==0)
Sq.pop_front()
}
The unmatched orders are the remaining orders in Bq or Sq (one of them if fatally empty, according to the while clause).

How can I do joins given a threashold in Hadoop using PIG

Let's say I have a dataset with following schema:
ItemName (String) , Length (long)
I need to find items that are duplicates based on their length. That's pretty easy to do in PIG:
raw_data = LOAD...dataset
grouped = GROUP raw_data by length
items = FOREACH grouped GENERATE COUNT(raw_data) as count, raw_data.name;
dups = FILTER items BY count > 1;
STORE dups....
The above finds exact duplicates. Given the set bellow:
a, 100
b, 105
c, 100
It will output 2, (a,c)
Now I need to find duplicates using a threshold. For example a threshold of 5 would mean match items if their length +/- 5. So the output should look like:
3, (a,b,c)
Any ideas how I can go about doing this?
It is almost like I want PIG to use a UDF as its comparator when it is comparing records during its join...
I think the only way to do what you want is to load the data into two tables and do a cartesian join of the data set onto itself, so that each value can be compared to each other value.
Pseudo-code:
r1 = load dataset
r2 = load dataset
rcross = cross r1, r2
rcross is a cartesian product that will allow you to check the difference in length between each pair.
I was solving a similar problem once and got one crazy and dirty solution.
It is based on next lemma:
If |a - b| < r then there exists such an integer number x: 0 <= x < r that
floor((a+x)/r) = floor((b+x)/r)
(further I will mean only integer division and will omit floor() function, i.e. 5/2=2)
This lemma is obvious, I'm not gonna prove it here
Based on this lemma you may do a next join:
RESULT = JOIN A by A.len / r, B By B.len / r
And get several values from all values corresponding to |A.len - B.len| < r
But doing this r times:
RESULT0 = JOIN A by A.len / r, B By (B.len / r)
RESULT1 = JOIN A by (A.len+1) / r, B By (B.len+1) / r
...
RESULT{R-1} = JOIN A by (A.len+r-1) / r, B By (B.len+r-1) / r
you will get all needed values. Of course you will get more rows than you need, but as I said already it's a dirty solution (i.e. it's not optimal, but works)
The other big disadvantage of this solution is that JOINs should be written dynamically and their number will be big for big r.
Still it works if you know r and it is rather small (like r=6 in your case)
Hope it helps

choosing row m from two matrices at random

I have two m*n matrices, A and P. I want to randomly choose the same 3 rows from both matrices, e.g. rows m, m+1, m+2 are picked from both matrices. I want to be able to make the calculation U=A-P on the selected subset (i.e. Usub-Psub), rather than before the selection. So far I have only been able to select rows from one matrix, without being able to match it to the other. The code I use for this is:
A=[0,1,1,3,2,4,4,5;0,2,1,1,3,3,5,5;0,3,1,1,4,4,2,5;0,1,1,1,2,2,5,5]
P=[0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0;0,0,0,0,0,0,0,0]
U=A-P
k = randperm(size(U,1));
Usub = U(k(1:3),:);
I would first create a function that returned a submatrix that had only three rows in it that takes an integer as the first of the three row. Then i'd do something like this:
m = number of rows;
randomRow = rand() % m;
U = A.sub(randomRow) - P.sub(randomRow);

Resources