Why does Cypher query performance increase after recreation of neo4j DB? - performance

I use neo4j 3.5.13 community edition. My application periodically makes a set of cypher-queries like this:
MERGE (unit:unitOfMeasureRegion{ name:"unit_1" })
MERGE (object:object { name: "obj_1" })
MERGE (object)-[:HAS_MEASUREMENT_UNITS]->(unit)
MERGE (object)-[:HAS_ROLE]->(role:someRole)
MERGE (object)<-[:PROVIDED_BY]-(value:SecAttributeValue)-[:IS_PART_OF]->(unit)
SET value.begin_of_time_region=datetime({somedate})
SET value.end_of_time_region=datetime({somedate})
RETURN value.begin_of_time_region, value.end_of_time_region, unit.name, object.name, id(value)
... and:
MATCH (unit)<-[:HAS_MEASUREMENT_UNITS]-(object:object { name: "object_1" })
WITH object, unit MATCH (object)<-[:PROVIDED_BY]-(value:SecondsAttributeValue)-[:IS_PART_OF]->(unit)
WITH object, value, unit
MATCH (object)-[:HAS_ROLE]->(role:someRole)
RETURN value.begin_of_time_region, value.end_of_time_region, unit.name, id(value)
Also I created indexes:
Indexes
ON :object(Name) ONLINE
ON :unitOfMeasureRegion(Name) ONLINE
I clean out database by MATCH (n) DETACH DELETE n (There were about 70 000 nodes) and from that point some part of the queries start run very slow i.e. for some label value:SecondsAttributeValue they are OK but if I change it to another label e.g value:MinuteAttributeValue then queries slowdown.
I also set memory settings in neo4j.conf in accordance with the neo4j memrec.
As workaround I recreated neo4j database so the problem disappears (all queries run pretty fast ).
What can be a reason of this?

Related

Phantom DSL Conditional Update

I have a following conditional update returning false. But when I check in the database the columns I was trying to update are in fact updated.
def deliver(d: Delivery, placedDate: java.time.LocalDate, locationKey: String, vendorId: String, orderId: String, code: String, courierId: String, courierName: String) = {
update.
where(_.placedDate eqs placedDate).
and(_.locationKey eqs locationKey).
and(_.vendorId eqs vendorId).
and(_.orderId eqs orderId).
modify(_.status setTo "DELIVERED").
and(_.deliveredTime setTo LocalDateTime.now()).
onlyIf(_.status is "COLLECTED").and(_.deliveryCode is code).future().map(_.wasApplied)
}
Thank you
This is a pass through value for the phantom driver, which means that the Datastax Java Driver underneath is the one generating this. If you want to follow this up, could you please post a full bug on GitHub?
Meanwhile, I would suggest not relying on wasApplied if you are simply trying to test, and instead doing a direct read.
You generate some test data and the updated values, perform the update, and compare the final results in Cassandra by reading back. There are known problems with wasApplied with conditional batch updates, but aside from that I'm expecting this to work.

Ruby neo4j-core mass processing data

Has anyone used Ruby neo4j-core to mass process data? Specifically, I am looking at taking in about 500k lines from a relational database and insert them via something like:
Neo4j::Session.current.transaction.query
.merge(m: { Person: { token: person_token} })
.merge(i: { IpAddress: { address: ip, country: country,
city: city, state: state } })
.merge(a: { UserToken: { token: token } })
.merge(r: { Referrer: { url: referrer } })
.merge(c: { Country: { name: country } })
.break # This will make sure the query is not reordered
.create_unique("m-[:ACCESSED_FROM]->i")
.create_unique("m-[:ACCESSED_FROM]->a")
.create_unique("m-[:ACCESSED_FROM]->r")
.create_unique("a-[:ACCESSED_FROM]->i")
.create_unique("a-[:ACCESSED_FROM]->r")
.create_unique("i-[:IN]->c")
.exec
However doing this locally it takes hours on hundreds of thousands of events. So far, I have attempted the folloiwng:
Wrapping Neo4j::Connection in a ConnectionPool and multi-threading it - I did not see much speed improvements here.
Doing tx = Neo4j::Transaction.new and tx.close every 1000 events processed - looking at a TCP dump, I am not sure this actually does what I expected. It does the exact same requests, with the same frequency, but just has a different response.
With Neo4j::Transaction I see a POST every time the .query(...).exec is called:
Request: {"statements":[{"statement":"MERGE (m:Person{token: {m_Person_token}}) ...{"m_Person_token":"AAA"...,"resultDataContents":["row","REST"]}]}
Response: {"commit":"http://localhost:7474/db/data/transaction/868/commit","results":[{"columns":[],"data":[]}],"transaction":{"expires":"Tue, 10 May 2016 23:19:25 +0000"},"errors":[]}
With Non-Neo4j::Transactions I see the same POST frequency, but this data:
Request: {"query":"MERGE (m:Person{token: {m_Person_token}}) ... {"m_Person_token":"AAA"..."c_Country_name":"United States"}}
Response: {"columns" : [ ], "data" : [ ]}
(Not sure if that is intended behavior, but it looks like less data is transmitted via the Non-Neo4j::Transaction technique - highly possibly I am doing something incorrectly)
Some other ideas I had:
* Post process into a CSV, SCP up and then use the neo4j-import command line utility (although, that seems kinda hacky).
* Combine both of the techniques I tried above.
Has anyone else run into this / have other suggestions?
Ok!
So you're absolutely right. With neo4j-core you can only send one query at a time. With transactions all you're really getting is the ability to rollback. Neo4j does have a nice HTTP JSON API for transactions which allows you to send multiple Cypher requests in the same HTTP request, but neo4j-core doesn't currently support that (I'm working on a refactor for the next major version which will allow this). So there are a number of options:
You can submit your requests via raw HTTP JSON to the APIs. If you still want to use the Query API you can use the to_cypher and merge_params methods to get the cypher and params for that (merge_params is a private method currently, so you'd need to send(:merge_params))
You can load via CSV as you said. You can either
use the neo4j-import command which allows you to import very fast but requires you to put your CSV in a specific format, requires that you be creating a DB from scratch, and requires that you create indexes/constraints after the fact
use the LOAD CSV command which isn't as fast, but is still pretty fast.
You can use the neo4apis gem to build a DSL to import your data. The gem will create Cypher queries under the covers and will batch them for performance. See examples of the gem in use via neo4apis-twitter and neo4apis-github
If you are a bit more adventurous, you can use the new Cypher API in neo4j-core via the new_cypher_api branch on the GitHub repo. The README in that branch has some documentation on the API, but also feel free to drop by our Gitter chat room if you have questions on this or anything else.
If you're implementing a solution which is going to make queries like above where you have multiple MERGE clauses, you'll probably want to profile your queries to make sure that you are avoiding the eager (that post is a bit old and newer versions of Neo4j have alleviated some of the need for care, but you can still look for Eager in your PROFILE)
Also worth a look: Max De Marzi's post on Scaling Cypher Writes

F# project is taking so much time to build

I have created f# solution and added one class library. Only one project in the solution and 5 files and 20 lines of code in each file. Still it will take more 2 minutes to build each time.
I have tried to clean solution.
Also created new solution and project and includes same files, still it will take same time to build it.
Note : First I have created it as a Console Application then convert it into the Class Library.
Edit: Code Sample `
open System
open Configuration
open DBUtil
open Definitions
module DBAccess =
let GetSeq (sql: string) =
let db = dbSchema.GetDataContext(connectionString)
db.DataContext.CommandTimeout <- 0
(db.DataContext.ExecuteQuery(sql,""))
let GetEmployeeByID (id: EMP_PersonalEmpID) =
GetSeq (String.Format("EXEC [EMP_GetEntityById] {0}",id.EmployeeID)) |> Seq.toList<EMP_PersonalOutput>
let GetEmployeeListByIDs (id : Emp_PersonalInput) =
GetSeq (String.Format("EXEC [EMP_GetEntityById] {0}",id.EmployeeID)) |> Seq.toList<EMP_PersonalOutput>`
configuration code snippets : `open Microsoft.FSharp.Data.TypeProviders
module Configuration =
let connectionString = System.Configuration.ConfigurationManager.ConnectionStrings.["EmpPersonal"].ConnectionString
//for database,then stored procedure, the getting the context,then taking the employee table
type dbSchema = SqlDataConnection<"", "EmpPersonal">
//let db = dbSchema.GetDataContext(connectionString)
type tbEmpPersonal = dbSchema.ServiceTypes.EMP_Personal`
Okay, seeing your actual code, I think the main problem is that the type provider connects to the database every time to retrieve the schema. The way to fix this is to cache the schema in a dbml file.
type dbSchema = SqlDataConnection<"connection string...",
LocalSchemaFile = "myDb.dbml",
ForceUpdate = false>
The first time, the TP will connect to the database as usual, but it will also write the schema to myDb.dbml. On subsequent compiles, it will load the schema from myDb.dbml instead of connecting to the database.
Of course, this caching means that changes to the database are not reflected in the types. So every time you need to reload the schema from the database, you can set ForceUpdate to true, do a compile (which will connect to the db), and set it back to false to use the updated myDb.dbml.
Edit: you can even commit the dbml file to your source repository if you want. This will have the additional benefit to allow collaborators who don't have access to a development version of the database to compile the solution anyway.
This answer about NGEN helped me once, but the build time of F# is still terrible compared to C#, just not minutes.

JCascalog/Pail shredding stage works locall,y but not in Hadoop

Following the "Big Data" Lambda Architecture book, I've got an incoming directory full of typed Thift Data objects, with a DataPailStructure defined pail.meta file
I take a snapshot of this data:
Pail snapshotPail = newDataPail.snapshot(PailFactory.snapshot);
The incoming files and meta data files are duplicated, and the pail.meta file also has
structure: DataPailStructure
Now I want to shred this data, to split it into vertical partitions. As from the book, I create two PailTap objects, one for the Snapshot and the SplitDataStructure, one for the new Shredded folder.
PailTap source = dataTap(PailFactory.snapshot);
PailTap sink = splitDataTap(PailFactory.shredded);
The /Shredded folder has a pail.meta file with structure: SplitDataPailStructure
Following the instructions, I execute the JCascalog query to force the reducer:
Api.execute(sink, new Subquery(data).predicate(reduced, empty, data));
Now, in local mode, this works fine. There's a "temporary" subfolder created under /Shredded, and this is vertically partitioned with the expected "1/1" structure. In local mode, this then is moved up to the /Shredded folder, and I can consolidate and merge to master without problems.
But running inside Hadoop, it fails at this point, with an error:
cascading.tuple.TupleException: unable to sink into output identifier: /tmp/swa/shredded
...
Caused by: java.lang.IllegalArgumentException: 1/1/part-000000 is not valid with the pail structure {structure=com.hibu.pail.SplitDataPailStructure, args={}, format=SequenceFile} --> [1, _temporary, attempt_1393854491571_12900_r_000000_1, 1, 1] at com.backtype.hadoop.pail.Pail.checkValidStructure(Pail.java:563)
Needless to say, if I change the Shredded Sink structure type to DataPailStructure, then it works fine, but it's a fairly pointless operation, as everything is as it was in the Incoming folder. It's okay for now, as I'm only working with one data type, but this is going to change soon and I'll need that partition.
Any ideas? I didn't want to post all my source code here initially, but I'm almost certainly missing something.

Has ElasticClient.TryConnect been removed from NEST?

Here's a code snippet we've used in the past to ping an Elastic Search node, just to check if it's there:
Nest.ElasticClient client; // has been initialized
ConnectionStatus connStatus;
client.TryConnect(out connStatus);
var isHealthy = connStatus.Success;
It looks like ElasticClient.TryConnect has been removed in NEST 0.11.5. Is it completely gone or has it just been moved to somewhere else just like MapRaw/CreateIndexRaw?
In case it's been removed, here's what I'm planning to do instead:
Nest.ElasticClient client; // has been initialized
var connectionStatus = client.Connection.GetSync("/");
var isHealthy = connectionStatus.Success;
Looks like this works - or is there a better way to replace TryConnect?
yes they have. See the release notes:
https://github.com/Mpdreamz/NEST/releases/tag/0.11.5.0
Excerpt from the release notes:
Removed IsValid and TryConnect()
The first 2 features of ElasticClient I wrote nearly three years ago which seemed like a good idea at the time. TryConnect() and .IsValid() are two confusing ways to check if your node is up, RootNodeInfo() now returns a mapped response of the info elasticsearch returns when you hit a node at the root (version, lucene_version etc), or you can call client.Raw.MainGet() or perhaps even better client.Raw.MainHead() or even client.Connection.HeadSync("/").
You catch my drift: with so many ways of querying the root .IsValid and TryConnect() is just fluff that only introduces confusion.

Resources