how are neo4j caches speeding up queries? - performance

I am currently working on a project using neo4j as database and queries that involve some hard relationship discover, and after running performance testing we are having some issues.
We have found out that cache is influencing the time of the requests insanely (from 3000ms to 100ms or so). Doing the same request twice would result in one really slow, and the second one much faster. After some searches we saw the warm-up method, that is going to preload all the nodes and relationships in the database querying something like this:
match (n)-[r]->() return count(1);
Having cache activated plus this warm-up query we had a big decrease of the time of our queries, but still not as fast as if you queried two, three or four times the same query.
So we went on testing and searching info until that we saw that Neo4j is also somehow buffering the queries in order to not be compiled every time (using Scala compiler, if I am right). I say somehow, because after intense testing I could conclude that Neo4j is compiling the query "on the fly".
Let me show a simplified example of what I mean:
(numbers are id attributes)
If I make a request like the following:
match (n:green {id: 1})-[r]->(:red)-[s]->(:green)<-[t]-(m:yellow {id: 7})
return count(m);
What I want to do is to find if there is a connection between the node 1 and the node. As you can see, I have to discover a bunch of nodes and more important, relationships, and the compile process looks more or less complicated since the request took 1227 ms to complete. If I make exactly the same request again, I get a response time of about 5 ms, good enough to pass the performance testing. Definitely Neo4j or the Scala compiler was buffering the cypher queries too.
After understanding that there is a compile process in the cypher request, I went deeper and started modifying only parts of an already buffered request. Changing the label or id parameter of the last node matched was also producing a delay, but only ~19 ms, still acceptable:
match (n:green {id: 1})-[r]->(:red)-[s]->(:green)<-[t]-(m:purple {id: 7})
return count(m);
However, when I restart the server, do warm-up and adjust the query so that the first node (labelled before as n) doesn't match, the query will respond very fast with 0 results so I can deduce that not all the query was parsed, since the first node didn't match and there is no need to go deeper in the tree.
I also tried with optional match, providing that returns null if no match was found, but it isn't working either.
I wanted to ask first of all if so far everything that I said based in my tests is correct and in case that it is not, how it's actually working ? And secondly, what should I do (if there is a way) to cache everything at the beginning, when the server started. Unfortunately, the requirements of the project say that queries should perform well, even the first one (and not to say that the real scenario has thousands more relationships and nodes, making everything slower), or if there is no way to avoid this delay.

First of all you need to consider JVM warm up - beware that classes are loaded lazily when needed (your first query) and JIT may only kick in after several (thousands) of calls.
This
match (n)-[r]->() return count(1);
should properly warm up node and relationship cache, however I am not sure if it also loads all their properties and indexes. Also make sure that your data set fits in memory.
Providing values directly in cypher query like this: {id: 1}, instead of using parameters{id: {paramId}} means that when you change the value of the id then the query needs to be compiled again.
You can pass parameters in this way in shell:
neo4j-sh (?)$ export paramId=5
neo4j-sh (?)$ return {paramId};
==> +-----------+
==> | {paramId} |
==> +-----------+
==> | 5 |
==> +-----------+
==> 1 row
==> 4 ms
So if you need to have performing queries from the beginning
change queries to use parameters
execute your other queries at startup together with your warm-up query
EDIT: added information how to pass parameters in shell

Related

ElasticSearch document refresh=true does not appear to work

In order to speed up searches on our website, I have created a small elastic search instance which keeps a copy of all of the "searchable" fields from our database. It holds only a couple million documents with an average size of about 1KB per document. Currently (in development) we have just 2 nodes, but will probably want more in production.
Our application is a "primarily read" application - maybe 1000 documents/day get updated, but they get read and searched 10's of thousands of times/day.
Each document represents a case in a ticketing system, and the case may change status during the day as users research and close cases. If a researcher closes a case and then immediately refreshes his queue of open work, we expect the case to disappear from their queue, which is driven by a query to our Elastic Search instance, filtering by status. The status is a field in the case index.
The complaint we're getting is that when a researcher closes a case, upon immediate refresh of his queue, the case still comes back when filtering on "in progress" cases. If he refreshes the view a second or two later, it's gone.
In an effort to work around this, I added refresh=true when updating the document, e.g.
curl -XPUT 'https://my-dev-es-instance.com/cases/_doc/11?refresh=true' -d '{"status":"closed", ... }'
But still the problem persists.
Here's the response I got from the above request:
{"_index":"cases","_type":"_doc","_id":"11","_version":2,"result":"updated","forced_refresh":true,"_shards":{"total":2,"successful":1,"failed":0},"_seq_no":70757,"_primary_term":1}
The response seems to verify that the forced_refresh request was received, although it does say out of total 2 shards, 1 was successful and 0 failed. Not sure about the other one, but since I have only 2 nodes, does this mean it updated the secondary?
According to the doc:
To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.
Are my expectations reasonable? Is there a better way to do this?
After more testing, I have concluded that my issue was due to application logic error, and not a problem with ElasticSearch. The refresh flag is behaving as expected. Apologies for the misinformation.

Elasticsearch high level REST client - Indexing has latency

we have started using the high level REST client finally, to ease the development of queries from backend engineering perspective. For indexing, we are using the client.update(request, RequestOptions.DEFAULT) so that new documents will be created and existing ones modified.
The issue that we are seeing is, the indexing is delayed, almost by 5 minutes. I see that they use async http calls internally. But that should not take so long, I looked for some timing options inside the library, didn't find anything. Am I missing anything or the official documentation is missing for this?
Since refresh_interval: 1 in your index settings, it means it is never refreshed unless you do it manually, which is why you don't see the data just after it's been updated.
You have three options here:
A. You can call the _update endpoint with the refresh=true (or refresh=wait_for) parameter to make sure that the index is refreshed just after your update.
B. You can simply set refresh_interval: 1s (or any other duration that makes sense for you) in your index settings, to make sure the index is automatically refreshed on a regular basis.
C. You can explicitly call index/_refresh on your index to refresh it whenever you think is appropriate.
Option B is the one that usually makes sense in most use cases.
Several reference on using the refresh wait_for but I had a hard time finding what exactly needed to be done in the rest high level client.
For all of you that are searching this answer:
IndexRequest request = new IndexRequest(index, DOC_TYPE, id);
request.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);

Consisntent N1QL Query Couchbase GOCB sdk

I'm currently implementing EventSourcing for my Go Actor lib.
The problem that I have right now is that when an actor restarts and need to replay all it's state from the event journal, the query might return inconsistent data.
I know that I can solve this using MutationToken
But, if I do that, I would be forced to write all events in sequential order, that is, write the last event last.
That way the mutation token for the last event would be enough to get all the data consistently for the specific actor.
This is however very slow, writing about 10 000 events in order, takes about 5 sec on my setup.
If I instead write those 10 000 async, using go routines, I can write all of the data in less than one sec.
But, then the writes are in indeterministic order and I can know which mutation token I can trust.
e.g. Event 999 might be written before Event 843 due to go routine scheduling AFAIK.
What are my options here?
Technically speaking MutationToken and asynchronous operations are not mutually exclusive. It may be able to be done without a change to the client (I'm not sure) but the key here is to take all MutationToken responses and then issue the query with the highest number per vbucket with all of them.
The key here is that given a single MutationToken, you can add the others to it. I don't directly see a way to do this, but since internally it's just a map it should be relatively straightforward and I'm sure we (Couchbase) would take a contribution that does this. At the lowest level, it's just a map of vbucket sequences that is provided to query at the time the query is issued.

Duplicates when linkswalking riak using ripple

I'm working on a project where I use Riak with Ripple, and I've stumbled on a problem.
For some reason I get duplicates when link-walking a structure of links. When I link walk using curl I don't get the duplicates as far as I can see.
The difference between my curl based link-walk
curl -v http://127.0.0.1:8098/riak/users/2306403e5177b4716da9df93b67300824aa2fd0e/_,projects,0/_,tasks,1
and my ruby ripple/riak-client based link walk
result = Riak::MapReduce.new(self.robject.bucket.client).
add(self.robject.bucket,self.key).
link(Riak::WalkSpec.new({:key => 'projects'})).
link(Riak::WalkSpec.new({:key => 'tasks', :bucket=>'tasks'})).
map("function(v){ if(!JSON.parse(v.values[0].data).completed) {return [v];} else { return [];} }", {:keep => true}).run
is as far as I can tell the map at the end.
However the result of the map/reduce contains several duplicates. I can't wrap my head around why. Now I've settled for removing the duplicates based on the key, but I wish that the riak result wouldn't contain duplicates, since it seems like waste to remove duplicates at the end.
I've tried the following:
Making sure there are no duplicates in the links sets of my ripple objects
Loading the data without the map reduce, but the link walk contains duplicate keys.
Any help is appreciated.
What you're running into here is an interesting side-effect/challenge of Map/Reduce queries.
M/R queries don't have any notion of read quorum values, and they necessarily have to hit every object (within the limitations of input filtering, of course) on every node.
Which means, when N > 1, the queries have to hit every copy of every object.
For example, let's say N=3, as per default. That means, for each written object, there are 3 copies, one each on 3 different nodes.
When you issue a read for an object (let's say with the default quorum value of R=2), the coordinating node (which received the read request from your client) contacts all 3 nodes (and potentially receives 3 different values, 3 different copies of the object).
It then checks to make sure that at least 2 of those copies have the same values (to satisfy the R=2 requirement), returns that agreed-upon value to the requesting client, and discards the other copies.
So, in regular operations (reads/writes, but also link walking), the coordinating node filters out the duplicates for you.
Map/Reduce queries don't have that luxury. They don't really have quorum values associated with them -- they are made to iterate over every (relevant) key and object on all the nodes. And because the M/R code runs on each individual node (close to the data) instead of just on the coordinating node, they can't really filter out any duplicates intrinsically. One of the things they're designed for, for example, is to update (or delete) all of the copies of the objects on all the nodes. So, each Map phase (in your case above) runs on every node, returns the matched 'completed' values for each copy, and ships the results back to the coordinating node to return to the client. And since it's very likely that your N>1, there's going to be duplicates in the result set.
Now, you can probably filter out duplicates explicitly, by writing code in the Reduce phase, to check if there's already a key present and reject duplicates if it is, etc.
But honestly, if I was in your situation, I would just filter out the duplicates in ruby on the client side, rather than mess with the reduce code.
Anyways, I hope that sheds some light on this mystery.

SubSonic AddMany() vs foreach loop Add()

I'm trying to figure out whether or not SubSonics AddMany() method is faster than a simple foreach loop. I poked around a bit on the SubSonic site but didn't see much on performance stats.
What I currently have. (.ForEach() just has some validation it it, other than that it works just like forEach(.....){ do stuff})
records.ForEach(record =>
{
newRepository.Add(record);
recordsProcessed++;
if (cleanUp) oldRepository.Delete<T>(record);
});
Which would change too
newRepository.AddMany(records);
if (cleanUp) oldRepository.DeleteMany<T>(records);
If you notice with this method I lose the count of how many records I've processed which isn't critical... But it would be nice to be able to display to the user how many records were moved with this tool.
So my questions boil down to: Would AddMany() be noticeably faster to use? And is there any way to get a count of the number of records actually copied over? If it succeeds can I assume all the records were processed? If one record fails, does the whole process fail?
Thanks in advance.
Just to clarify, AddMany() generates individual queries per row and submits them via a batch; DeleteMany() generates a single query. Please consult the source code and the generated SQL when you want to know what happens to your queries.
Your first approach is slow: 2*N queries. However, if you submit the queries using a batch it would be faster.
Your second approach is faster: N+1 queries. You can find how many will be added simply by enumerating 'records'.
If there is a risk of exceeding capacity limits on the size of a batch, then submit 50 or 100 at a time with little penalty.
Your final question depends on transactions. If the whole operation is one transaction, it will commit of abort as one. Otherwise, each query will stand alone. Your choice.

Resources