What I need to do
For each query I need to find the user by device_id, or create a new node if doesn't exist. And for each, I need to update/create a few edges if rows contains certain properties. The load is massive(about 20k per second) and neo4j slows down. Each batch size is exactly 20k. Here is my query:
UNWIND {batch} as row
MERGE (m:User {device_id: row.device_id})
FOREACH (ignore IN CASE WHEN row.type IS NOT NULL THEN [1] ELSE [] END |
MERGE (e:Event {type: row.type})
MERGE (m) -[r:REL]-> (e)
SET r.count = ( CASE r.count WHEN NULL THEN 1 ELSE r.count + 1 END)
)
FOREACH (ignore IN CASE WHEN row.country IS NOT NULL THEN [1] ELSE [] END |
MERGE (c:Country {id: row.country})
MERGE (m) -[:Belongs]-> (c)
)
WITH m, ( CASE row.user_id WHEN NULL THEN m.user_id ELSE row.user_id END) AS user_id
SET m.user_id = user_id
I solved this issue by decreasing the batch size to 5k and running a few of those inside a transaction in parallel before committing.
Related
This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.
Lets say I have a table like so:
{
value = 4
},
{
value = 3
},
{
value = 1
},
{
value = 2
}
and I want to iterate over this and print the value in order so the output is like so:
1
2
3
4
How do I do this, I understand how to use ipairs and pairs, and table.sort, but that only works if using table.insert and the key is valid, I need to be loop over this in order of the value.
I tried a custom function but it simply printed them in the incorrect order.
I have tried:
Creating an index and looping that
Sorting the table (throws error: attempt to perform __lt on table and table)
And a combination of sorts, indexes and other tables that not only didn't work, but also made it very complicated.
I am well and truly stumped.
Sorting the table
This was the right solution.
(throws error: attempt to perform __lt on table and table)
Sounds like you tried to use a < b.
For Lua to be able to sort values, it has to know how to compare them. It knows how to compare numbers and strings, but by default it has idea how to compare two tables. Consider this:
local people = {
{ name = 'fred', age = 43 },
{ name = 'ted', age = 31 },
{ name = 'ned', age = 12 },
}
If I call sort on people, how can Lua know what I intend? I doesn't know what 'age' or 'name' means or which I'd want to use for comparison. I have to tell it.
It's possible to add a metatable to a table which tells Lua what the < operator means for a table, but you can also supply sort with a callback function that tells it how to compare two objects.
You supply sort with a function that receives two values and you return whether the first is "less than" the second, using your knowledge of the tables. In the case of your tables:
table.sort(t, function(a,b) return a.value < b.value end)
for i,entry in ipairs(t) do
print(i,entry.value)
end
If you want to leave the original table unchanged, you could create a custom 'sort by value' iterator like this:
local function valueSort(a,b)
return a.value < b.value;
end
function sortByValue( tbl ) -- use as iterator
-- build new table to sort
local sorted = {};
for i,v in ipairs( tbl ) do sorted[i] = v end;
-- sort new table
table.sort( sorted, valueSort );
-- return iterator
return ipairs( sorted );
end
When sortByValue() is called, it clones tbl to a new sorted table, and then sorts the sorted table. It then hands the sorted table over to ipairs(), and ipairs outputs the iterator to be used by the for loop.
To use:
for i,v in sortByValue( myTable ) do
print(v)
end
While this ensures your original table remains unaltered, it has the downside that each time you do an iteration the iterator has to clone myTable to make a new sorted table, and then table.sort that sorted table.
If performance is vital, you can greatly speed things up by 'caching' the work done by the sortByValue() iterator. Updated code:
local resort, sorted = true;
local function valueSort(a,b)
return a.value < b.value;
end
function sortByValue( tbl ) -- use as iterator
if not sorted then -- rebuild sorted table
sorted = {};
for i,v in ipairs( tbl ) do sorted[i] = v end;
resort = true;
end
if resort then -- sort the 'sorted' table
table.sort( sorted, valueSort );
resort = false;
end
-- return iterator
return ipairs( sorted );
end
Each time you add or remove an element to/from myTable set sorted = nil. This lets the iterator know it needs to rebuild the sorted table (and also re-sort it).
Each time you update a value property within one of the nested tables, set resort = true. This lets the iterator know it has to do a table.sort.
Now, when you use the iterator, it will try and re-use the previous sorted results from the cached sorted table.
If it can't find the sorted table (eg. on first use of the iterator, or because you set sorted = nil to force a rebuild) it will rebuild it. If it sees it needs to resort (eg. on first use, or if the sorted table was rebuilt, or if you set resort = true) then it will resort the sorted table.
New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html
I have a query that's running slow (in a loop of about 100 it takes 5-10 seconds) and have no clue why. It's simply querying against a List of objects... your help is much appreciated!
I'm basically querying for Schedules that have been assigned to specific managers. It must be from the specified Shifts week OR the first 2 days of next week OR the last 2 days of the previous week.
I tried calculating .AddDays before but that didn't help. When I ran a performance test it highlighted the "from" statement below.
List<Schedule> _schedule = Schedule.GetAll();
List<Shift> _shifts = Shift.GetAll();
// Then later...
List<Schedule> filteredSchedule = (from sch in _schedule
from s in _shifts
where
**sch.ShiftID == s.ShiftID
& (sch.ManagerID == 1 | sch.ManagerID == 2 | sch.ManagerID == 3)
& ((s.ScheduleWeek == shift.ScheduleWeek)
| (s.ScheduleWeek == shift.ScheduleWeek.AddDays(7)
& (s.DayOfWeek == 1 | s.Code == 2))
| (sch.ScheduleWeek == shift.ScheduleWeek.AddDays(-7)
& (s.DayOfWeek == 5 | s.Code == 6)))**
select sch)
.OrderBy(sch => sch.ScheduleWeek)
.ThenBy(sch => sch.DayOfWeek)
.ToList();
First port of call: use && instead of & and || instead of |. Otherwise all the subexpressions in the where clause will be evaluated, even if the answer is already known.
Second port of call: use a join instead of two "from" clauses with a where:
var filteredSchedule = (from sch in _schedule
join s in _shifts on s.ShiftID equals sch.ShiftID
where ... rest of the condition ...
Basically that's going to create a hash of all the shift IDs, so it can quickly look up possible matches for each schedule.
I have a table in SQL database:
ID Data Value
1 1 0.1
1 2 0.4
2 10 0.3
2 11 0.2
3 10 0.5
3 11 0.6
For each unique value in Data, I want to filter out the row with the largest ID. For example: In the table above, I want to filter out the third and fourth row because the fifth and sixth rows have the same Data values but their IDs (3) are larger (2 in the third and fourth row).
I tried this in Linq to Entities:
IQueryable<DerivedRate> test = ObjectContext.DerivedRates.OrderBy(d => d.Data).ThenBy(d => d.ID).SkipWhile((d, index) => (index == size - 1) || (d.ID != ObjectContext.DerivedRates.ElementAt(index + 1).ID));
Basically, I am sorting the list and removing the duplicates by checking if the next element has an identical ID.
However, this doesn't work because SkipWhile(index) and ElementAt(index) aren't supported in Linq to Entities. I don't want to pull the entire gigantic table into an array before sorting it. Is there a way?
You can use the GroupBy and Max function for that.
IQueryable<DerivedRate> test = (from d in ObjectContext.DerivedRates
let grouped = ObjectContext.DerivedRates.GroupBy(dr => dr.Data).First()
where d.Data == grouped.Key && d.ID == grouped.Max(dg => dg.ID)
orderby d.Data
select d);
Femaref's solution is interesting, unfortunately, it doesn't work because an exception is thrown whenever "ObjectContext.DerivedRates.GroupBy(dr => dr.Data).First()" is executed.
His idea has inspired me for another solution, something like this:
var query = from d in ObjectContext.ProviderRates
where d.ValueDate == valueDate && d.RevisionID <= valueDateRevision.RevisionID
group d by d.RateDefID into g
select g.OrderByDescending(dd => dd.RevisionID).FirstOrDefault();
Now this works.