I thought this would be very easy, because i have a typical graph use case: Expand a node.
This is easy if there are no additional requirements:
MATCH (s:Entity)-[]-(dest) WHERE s._id = 'xxx'
RETURN dest
Problem Nr.1: sometimes there are many children so i want to limit the return count!
MATCH (s:Entity)-[]-(dest) WHERE s._id = 'xxx'
RETURN dest
LIMIT 100
Additional requirement: Return all children ids of the children childrens!
MATCH (s:Entity)-[]-(dest) WHERE s._id = 'xxx'
WITH collect(dest) as childrenSource
LIMIT 100
MATCH (childrenSource)-[]-(childDestination)
RETURN childrenSource as expandNode, collect(childDestination) as childrenIds
LIMIT 100
Problem 2: The limits are in the wrong place, because collect already did the collection before the limit.
Possible solution:
MATCH (s:Entity)-[]-(dest) WHERE s._id = 'xxx'
WITH collect(dest)[..100] as childrenSource
LIMIT 100
MATCH (childrenSource)-[]-(childDestination)
RETURN childrenSource as expandNode, collect(childDestination)[..100] as childrenIds
But i dont thinks this is a performant solution. Because it takes quite a lot of time
Exact Problem description: If i have 1 node with 1000 children and each child has another 1000 children i want to execute a query which returns 100 children with 100 child ids
-------------------------------------------------
| node 1 | child id 1_1,.... child id 1_100 |
| node 2 | child id 2_1,.... child id 2_100 |
| ... | ... |
| node 100 | child id 100_1,.. child id 100_100 |
-------------------------------------------------
Other solution: i do a simple expand for the node. and than i call an expand on each child node. But doing 101 queries instead of 1 query sounds not too performant either.
EDIT
As usual, APOC Procedures to the rescue. Using apoc.cypher.run(), you can use LIMIT within a subquery, which lazy-load the expansion up to your limit.
MATCH (s:Entity)-[]-(dest) WHERE s._id = 'xxx'
WITH dest
LIMIT 100
CALL apoc.cypher.run('
MATCH (dest)-[]-(childDestination)
RETURN childDestination LIMIT 100
', {dest:dest}) YIELD value
RETURN dest as expandNode, COLLECT(value.childDestination) as childrenIds
Using cypher would help you
MATCH (entity1)-[rel]-(entity2) WHERE entity1.title = "something"
WITH entity2
LIMIT 100
CALL apoc.cypher.run('
MATCH (entity2)-[]-(childDestination)
RETURN childDestination
LIMIT 100
', {entity2:entity2}) YIELD value
RETURN entity2 as expandNode, COLLECT(value.childDestination) as childrenId
I have to found all paths between two nodes. The length of each path has to be beetween 1 and 5 ( 2 and 3 for this exemple ).
So i'm using this query :
profile match p = (a:Station {name : 'X'} ) - [r*2..3] -> (b:Station {name : 'Y'} ) return distinct p
I have an index on :Station(name)
but when I profile this query I have this result :
So the problem is neo4j takes every relationship possible for this node B and then filters using the name. Is it a way for just taking the relation which involved this two specific nodes ?
Maybe you might want to use allShortestPaths for that, eg :
PROFILE MATCH p=allShortestPaths((n:Person {name:'Ian Robinson'})-[r*1..5]–(b:Person {name:'Michal Bachman'}))
RETURN p
Many times we are interested in taking the top or bottom of a set (after order by) which has been grouped on certain keys before ordering.
A = FOREACH data
GENERATE x,y,z;
B = DISTINCT A;
C = GROUP B BY (x,y) PARALLEL 11;
D = FOREACH C {
ORDERD = ORDER B BY z DESC;
FIRST_REC = LIMIT ORDERD 1;
GENERATE FLATTEN(FIRST_REC) AS (x,y,z);
};
STORE D INTO 'xyz' USING PigStorage();
The foreach generate above takes 'forever' to finish and eventually getting killed after 12 hours or so.
The mapreduce job responsible for this say spawned 3maps, 4reducers then 1 reducer remains processing for entire day and eventually kills off due to ERROR 6017, file error.
Is there a way to solve this or a better way of doing what I want to do ?
What is the volume of data involved ? Are you sure that your datanode(s) are big enough to handle that amount of data ?
If so, instead of an ORDER, I will go for a MAX. That way, only one tuple have to be kept in memory and it is sufficient because group already contains all the other needed information:
D = FOREACH C GENERATE group, MAX (B.z);
So i am creating a system where users are able to build their own organization structure meaning that all organizations will most likely be different.
My setup is that an organization consists of different divisions. In my division table i have a value called parent_id that points to a division who is the current divisions parent.
a setup might look something like this (Paint drawing)
as you can see from the drawing division 2 and 3 are children of division 1 therefore they both have the value parent_id = 1
division 4 is a child of id 2 and has two children (5 & 6)
now to the tricky part because of the structure in my system i need access to all children and the childrens children in my system depending on a root node.
So for example if i want to know all of the children of division 1 the result should be [2,3,4,5,6]
Now my question is. how will i find all children connected?
At first i thought something like this
root = 1;
while(getChildren(root) != null)
{
}
function getChildren(root)
{
var result = 'select * from division where parent_id = '+root;
if(result != null)
{
root = result;
}
return result;
}
please note this is only an example of using a while loop to get through the list
However this would not work when the result of the statement returns two children
So my question is how do i find all children of any root id with the above setup?
You could use a recursive function. Be careful, and keep track of the children you have found so if you run into them again you stop and error - otherwise you will end up in an infinite loop.
I don't know what language you are using, so here's some psuedocode:
create dictionaryOfDivisions
dictionaryOfDivisions.Add(currentDivision)
GetChildren(currentDivision)
Function GetChildren(thisDivision) {
theseChildren = GetChildrenFromDB(thisDivision)
For each child in theseChildren
If dictionaryOfDivisions.Exists(child)
'Oops, here's a loop! Error
Exit
Else
dictionaryOfDivisions.Add(child)
GetChildren(child)
End If
Next
}
In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.