Loading large cypher file in Neo4J - performance

I'm having some difficulty loading a Cypher file into Neo4J in Windows 10. The file in question is a 175 Mb .cql file filled with more than a million lines of nodes and edges (separated by semicolons) in the Cypher language -- CREATE [node], that sort of thing. For smaller items, I have been using an APOC command in the web browser:
call apoc.cypher.runFile('file:///<file path>')
but this is too slow for a million+ query file. I've created indexes for the nodes, and am currently running it through a command:
neo4j-shell -file <file path> -path localhost
but this is still slow. I was wondering, is there any way to speed up the intake?
Also, note that I am using an recent ONGDB build, rather than straight Neo4J; I do not believe this will make any substantial difference.

If the purpose of your very large CQL file is simply to ingest data, then doing it purely in Cypher is going to be very slow (and may even cause an out-of-memory error).
If you are ingesting into a new neo4j DB, you should consider refactoring the data out of it and using the import command of neo4j-admin tool to efficiently ingest the data.
If you are ingesting into an existing DB, you should consider refactoring the data and logic out of the CQL file and using LOAD CSV.

I ended up ingesting it using cypher-shell. It's still slow, but at least it does finish. Using it requires one to first open a Neo4J console then, in a second command line, use:
type <filepath>\data.cql | bin\cypher-shell.bat -a localhost -u <user> -p <password> --fail-at-end
This works for Windows 10, although it does take a while.

When running a query outside of a transaction, neo4j will automatically start and commit a separate transaction for every query. You can speed things up by starting a transaction at the beginning, and committing and starting a new transactions every few thousand queries (memory use will go up with transaction size, so that's the limiting factor on how large the transactions can be).
Example queries.cypher (with transactions of size 3):
:begin
CREATE(n:PERSON { name: "Homer Simpson" })
CREATE(n:PERSON { name: "Marge Simpson" })
CREATE(n:PERSON { name: "Abe Simpson" })
:commit
:begin
CREATE(n:PERSON { name: "Bart Simpson" })
CREATE(n:PERSON { name: "Lisa Simpson" })
CREATE(n:PERSON { name: "Maggie Simpson" })
:commit
And then run cypher-shell < queries.cypher as usual.

Related

NATS KV history larger than specified when creating bucket

We are using nats with KeyValue store feature (nats KV). We develop go microservices and use the nats go client. We try to leverage the history feature of nats KV with no success yet.
Certain times using nats, we retrieve a larger history than the history specified when creating the KV.
We create the KV using :
kv, _ := js.CreateKeyValue(&nats.KeyValueConfig{
Bucket: "some-bucket",
Description: "store for some-service",
MaxValueSize: 0,
History: 10, // should we ever get more than 10 elements when reading history ?
TTL: TTL,
MaxBytes: 5000000,
Storage: nats.MemoryStorage,
Replicas: 0,
Placement: nil,
})
and we retrieve values using
kv.History("someId")
When we get results larger than the specified History, we get several KeyValueEntrys with the same delta value.
We are quite write intensive, and also reuse quite a lot the same key id :
we write values until a certain point,
call kv.Purge("someId")
and then we may reuse "someId" later on in the process.
Writes and read are asynchronous and concurrent.
Here is our client go.mod regarding nats:
github.com/nats-io/nats-server/v2 v2.8.4
github.com/nats-io/nats.go v1.16.0
and we run a nats server version 2.8.4.
note : I did not go far enough in the KV implementation details but I am worried that this is linked with jetstream. It seems like a watcher is created each time and re-reads all previous values regardless of history size. It leads me to another question : is the kv history feature appropriate for read intensive use cases ?
Thanks for your help or pointers on this matter.

Advice on ElasticSearch Cluster Configuration

Brand new to Elasticsearch. I've been doing tons of reading, but I am hoping that the experts on SO might be able to weigh in on my cluster configuration to see if there is something that I am missing.
Currently I am using ES (1.7.3) to index some very large text files (~700 million lines) per file and looking for one index per file. I am using logstash (V2.1) as my method of choice for indexing the files. Config file is here for my first index:
input {
file {
path => "L:/news/data/*.csv"
start_position => "beginning"
sincedb_path => "C:/logstash-2.1.0/since_db_news.txt"
}
}
filter {
csv {
separator => "|"
columns => ["NewsText", "Place", "Subject", "Time"]
}
mutate {
strip => ["NewsText"]
lowercase => ["NewsText"]
}
}
output {
elasticsearch {
action => "index"
hosts => ["xxx.xxx.x.xxx", "xxx.xxx.x.xxx"]
index => "news"
workers => 2
flush_size => 5000
}
stdout {}
}
My cluster contains 3 boxes running on Windows 10 with each running a single node. ES is not installed as a service and I am only standing up one master node:
Master node: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
Data Node #1: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
This node is currently running the logstash instance with LS_HEAP_SIZE= 3000m
Data Node #2: 16GB RAM, ES_HEAP_SIZE = 8000m, Single Core i7
I have ES currently configured at the default 5 shards + 1 duplicate per index.
At present, each node is configured to write data to an external HD and logs to another.
In my test run, I am averaging 10K events per second with Logstash. My main goal is to optimize the speed at which these files are loaded into ES. I am thinking that I should be closer to 80K based on what I have read.
I have played around with changing the number of workers and flush size, but can't seem to get beyond this threshold. I think I may be missing something fundamental.
My questions are two fold:
1) Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?
2) Would it help if I ran an instance of logstash on each data node indexing separate files?
Thanks so much for any and all help in advance and for taking the time to read.
-Zinga
i'd first have a look at whether logstash or es is the bottleneck in your setup. try ingesting the file without the es output. what throughtput are you getting from plain/naked logstash.
if this is considerably higher then you can continue on the es side of things. A good starter might be:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
If plain logstash doesn't yield a significant increase in throughput you can can try increasing / parallelising logstash across your machines.
hope that helps

Neo4j 2.0.0 - Poor performance for dev/test in a virtual machine

I have Neo4j server running inside a virtual machine using Ubuntu 13.10 and I am accessing via REST using Cypher queries. The virtual machine has 4 GB of memory allocated to it.
I've changed the open file count to 40000, set the initial JVM heap to 1G and my neo4j.properties file is as follows:
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=100M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=100M
keep_logical_logs=3 days
node_auto_indexing=true
node_keys_indexable=id
I've also updated sysctl based on the Neo4j Linux tuning guide:
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
Since I am testing queries, the basic routine is to run my suite of tests and then delete all of the nodes and run them all again. At the start of each test run, the database has 0 nodes in it. My suite of tests of about 100 queries is taking 22 seconds to run. Basic parameterized creates such as:
CREATE (x:user { email: {param0},
name: {param1},
displayname: {param2},
id: {param3},
href: {param4},
object: {param5} })
CREATE x-[:LOGIN]->(:login { password: {param6},
salt: {param7} } )
are currently taking over 170ms to execute (and that's the average, first query time is 700ms). During a test run, the CPU in the VM never exceeds 50% and memory usage is at a steady 1.4Gb.
Why would creating a single node in an empty database take 170ms? At this point unit testing is becoming almost impossible since it is so slow. This is my first time trying to tune Neo4j so I'm not really sure how to figure out where the problem is or what changes should be made.
Additional Details
I'm using Go 1.2 to make REST calls to the cypher endpoint (http://localhost:7474/db/data/cypher) of a locally installed Neo4j instance. I'm setting the request headers for content-type to "application/json", accept to "application/json" and "X-Stream" to true. I always return either an array of maps or nothing depending on the query.
It seems like the creates are the problem and are taking forever. For example:
2014/01/15 11:35:51 NewUser took 123.314938ms
2014/01/15 11:35:51 NewUser took 156.101784ms
2014/01/15 11:35:52 NewUser took 167.439442ms
2014/01/15 11:35:52 ValidatePassword took 4.287416ms
NewUser creates two new nodes and one relationship and is taking 167ms, while ValidatePassword is a read-only operation and it completes in 4ms. Also note that the three calls to NewUser are identical parameterized queries. While the creates are the big problem, I'm also a little concerned that Neo4j is taking 4ms to just find a labeled node when there are only 100 nodes in the database.
I do not restart the server in between test runs or delete the database. I issue a single delete all nodes query MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r at the end of the test run. Running the same test suite multiple times back to back does not improve the query times.
Are your 100 queries all the same only with different parameters, or actually 100 different queries?
What you see is actually setup work. The parser has to load the parsing rules initially that takes a few ms. Also new queries that have not been seen are compiled, planned and put in the query cache.
So the first query always takes a bit longer. But as you parametrize all subsequent ones should be fast.
Can you confirm that?
I think you see the transactional overhead of flushing the transaction to disk.
Did you try to batch more requests into one? I.e. with the transactional endpoint? Or the /db/data/batch (but I'd rather use the new tx-endpoint /db/data/transaction).
Did you create an index for your lookup property for your validate query?
Can you do me a favor and test your create query without a label? I found some perf issues when testing that myself earlier this week.
Just ran a test with curl
for i in `seq 1 10`; do time curl -i -H content-type:application/json -H accept:application/json -H X-Stream:true -d #perf_test.json http://localhost:7474/db/data/cypher; done
I'm getting between 16 and 30ms per request externally including starting curl
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8; stream=true
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Server: Jetty(9.0.5.v20130815)
{"columns":[],"data":[]}
real 0m0.016s
user 0m0.005s
sys 0m0.005s
Perhaps it is rather the VM (disk or network) or the cross-vm communication?
Did another test with ab and 1000 requests for both endpoints, got a mean of about 5 ms both times.
https://gist.github.com/jexp/8452037

How to find queries not using indexes or slow in mongodb

is there a way to find queries in mongodb that are not using Indexes or are SLOW? In MySQL that is possible with the following settings inside configuration file:
log-queries-not-using-indexes = 1
log_slow_queries = /tmp/slowmysql.log
The equivalent approach in MongoDB would be to use the query profiler to track and diagnose slow queries.
With profiling enabled for a database, slow operations are written to the system.profile capped collection (which by default is 1Mb in size). You can adjust the threshold for slow operations (by default 100ms) using the slowms parameter.
First, you must set up your profiling, specifying what the log level that you want. The 3 options are:
0 - logger off
1 - log slow queries
2 - log all queries
You do this by running your mongod deamon with the --profile options:
mongod --profile 2 --slowms 20
With this, the logs will be written to the system.profile collection, on which you can perform queries as follows:
find all logs in some collection, ordering by ascending timestamp:
db.system.profile.find( { ns:/<db>.<collection>/ } ).sort( { ts: 1 } );
looking for logs of queries with more than 5 milliseconds:
db.system.profile.find( {millis : { $gt : 5 } } ).sort( { ts: 1} );
You can use the following two mongod options. The first option fails queries not using index (V 2.4 only), the second records queries slower than some ms threshold (default is 100ms)
--notablescan
Forbids operations that require a table scan.
--slowms <value>
Defines the value of “slow,” for the --profile option. The database logs all slow queries to the log, even when the profiler is not turned on. When the database profiler is on, mongod the profiler writes to the system.profile collection. See the profile command for more information on the database profiler.
You can use the command line tool mongotail to read the log from the profiler within a console and with a more readable format.
First activate the profiler and set the threshold in milliseconds for the profile to consider an operation to be slow. In the following example the threshold is set to 10 milliseconds for a database named "sales":
$ mongotail sales -l 1
Profiling level set to level 1
$ mongotail sales -s 10
Threshold profiling set to 10 milliseconds
Then, to see in "real time" the slow queries, with some extra information like the time each query took, or how many registries it need to "walk" to find a particular result:
$ mongotail sales -f -m millis nscanned docsExamined
2016-08-11 15:09:10.930 QUERY [ops] : {"deleted": {"$exists": false}, "prod_id": "367133"}. 8 returned. nscanned: 344502. millis: 12
2016-08-11 15:09:10.981 QUERY [ops] : {"deleted": {"$exists": false}, "prod_id": "367440"}. 6 returned. nscanned: 345444. millis: 12
....
In case somebody ends up here from Google on this old question, I found that explain really helped me fix specific queries that I could see were causing COLLSCANs from the logs.
Example:
db.collection.find().explain()
This will let you know if the query is using a COLLSCAN (Basic Cursor) or an index (BTree), among other things.
https://docs.mongodb.com/manual/reference/method/cursor.explain/
While you can obviously use Profiler a very neat feature of Mongo DB due to which I actually fall in love with it is Mongo DB MMS.
Takes less than 60 seconds and can manage from anywhere. I am sure you will Love it.
https://mms.mongodb.com/

can't reproduce/verify the performance claims in graph databases and neo4j in action books

UPDATE I have put up a follow up question that contains updated scripts and and a clearer setup on neo4j performance compared to mysql (how can it be improved?). Please continue there./UPDATE
I have some problems verifying the performance claims made in the "graph databases" book (page 20) and in the neo4j (chapter 1).
To verify these claims I created a sample dataset of 100000 'person' entries with 50 'friends' each, and tried to query for e.g. friends 4 hops away. I used the very same dataset in mysql. With friends of friends over 4 hops mysql returns in 0.93 secs, while neo4j needs 65 -75 secs (on repeated calls).
How can I improve this miserable outcome, and verify the claims made in the books?
A bit more detail:
I run the whole setup on a i5-3570K with 16GB Ram, using ubuntu12.04 64bit, java version "1.7.0_25" and mysql 5.5.31, neo4j-community-2.0.0-M03 (I get a similar outcome with 1.9)
All code/sample data can be found on https://github.com/jhb/neo4j-experiements/ (to be used with 2.0.0). The resulting sample data in different formats can be found on https://github.com/jhb/neo4j-testdata.
To use the scripts you need a python with mysql-python, requests and simplejson installed.
the dataset is created with friendsdata.py and stored to friends.pickle
friends.pickle gets imported to neo4j using import_friends_neo4j.py
friends.pickle gets imported to mysql using import_friends_mysql.py
I add indexes on t_user_friend.* in mysql
I added "create index on :node(noscenda_name) in neo4j
To make life a bit easier the friends.*.bz2 contain sql and cypher statements to create those datasets in mysql and neo4j 2.0 M3.
Mysql performance
I first warm mysql up by querying:
select count(distinct name) from t_user;
select count(distinct name) from t_user;
Then, for the real meassurment I do
python query_friends_mysql.py 4 10
This creates the following sql statement (with changing t_user.names):
select
count(*)
from
t_user,
t_user_friend as uf1,
t_user_friend as uf2,
t_user_friend as uf3,
t_user_friend as uf4
where
t_user.name='person8601' and
t_user.id = uf1.user_1 and
uf1.user_2 = uf2.user_1 and
uf2.user_2 = uf3.user_1 and
uf3.user_2 = uf4.user_1;
and repeats this 4 hop query 10 times. The queries need around 0.95 secs each. Mysql is configured to use a key_buffer of 4G.
neo4j performance testing
I have modified neo4j.properties:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
and the neo4j-wrapper.conf:
wrapper.java.initmemory=2048
wrapper.java.maxmemory=8192
To warm up neo4j I do
start n=node(*) return count(n.noscenda_name);
start r=relationship(*) return count(r);
Then I start using the transactional http endpoint (but I get the same results using the neo4j-shell).
Still warming up, I run
./bin/python query_friends_neo4j.py 3 10
This creates a query of the form (with varying person ids):
{"statement": "match n:node-[r*3..3]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}
after the 7th call or so each call needs around 0.7-0.8 secs.
Now for the real thing (4 hops) I do
./bin/python query_friends_neo4j.py 4 10
creating
{"statement": "match n:node-[r*4..4]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}
and each call takes between 65 and 75 secs.
Open questions/thoughts
I'd really like see the claims in the books to be reproducable and correct, and neo4j faster then mysql instead of magnitudes slower.
But I don't know what I am doing wrong... :-(
So, my big hopes are:
I didn't do the memory settings for neo4j correctly
The query I use for neo4j is completely wrong
Any suggestions to get neo4j up to speed are highly welcome.
Thanks a lot,
Joerg
2.0 has not been performance optimized at all, so you should use 1.9.2 for comparison.
(if you use 2.0 - did you create an index for n.noscenda_name)
You can check the query plan with profile start ....
With 1.9 please use a manual index or node_auto_index for noscenda_name.
Can you try these queries:
start n=node:node_auto_index(noscenda_name={target})
match n-->()-->()-->m
return count(*);
Fulltext indexes are also more expensive than exact indexes, so keep the exact auto-index for noscenda_name.
can't get your importer to run, it fails at some point, perhaps you can share the finished neo4j database
python importer.py
reading rels
reading nodes
delete old
Traceback (most recent call last):
File "importer.py", line 9, in <module>
g.query('match n-[r]->m delete r;')
File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 99, in query
return self.call(payload)
File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 71, in call
self.transactionurl = result.headers['location']
File "/Library/Python/2.7/site-packages/requests-1.2.3-py2.7.egg/requests/structures.py", line 77, in __getitem__
return self._store[key.lower()][1]
KeyError: 'location'
Just to add to what Michael said, in the book I believe the authors are referring to a comparison that was done in the Neo4j in Action book - it's described in the free first chapter of that book.
At the top of page 7 they explain that they were using the Traversal API rather than Cypher.
I think you'll struggle to get Cypher near that level of performance at the moment so if you want to do those types of queries you'll want to use the Traversal API directly and then perhaps wrap it in an unmanaged extension.

Resources