I am performing mix of queries(reads/write/updates/deletes) to a single memgraph instance.
To do the same I am using Java client by Neo4j, all the APIs I am currently using are sync APIs from the driver.
Nature of queries in my case is such that I can execute them concurrently with no side effects. For better performance I am firing the queries in parallel. The error I am getting is for a CREATE operation where I am creating an edge between two nodes. This is consistent as I tried running this same setup multiple times and every time, all queries go through except it crashes when it comes to this create edge stage.
Query for reference:
OPTIONAL MATCH (node1) WHERE id(node1) = $nodeId1
OPTIONAL MATCH (node2) WHERE id(node2) = $nodeId2
CREATE (node1)-[:KNOWS]-> (node2)
I am not able to find any documentation around any such error. Please point me to some document like this or any workaround using which I can ask memgraph to put the query on hold if same objects are being operated by some other query.
One approach I am thinking is just implement retry for any such failed queries, but I am looking for a cleaner approach.
P.S. I was running the same setup on Neo4j earlier and did not encounter any problems with it.
Yep, in the case of this error, the code should retry the query. I think an equivalent issue can happen in Neo4j, but since Memgraph is more optimistic about locking, sometimes the error might happen more often. In general, the correct approach is to have error handling for this case implemented.
Related
I'm trying to implement a batch query interface with GraphQL. I can get a request to work synchronously without issue, but I'm not sure how to approach making the result asynchronous. Basically, I want to be able to kick off the query and return a pointer of sorts to where the results will eventually be when the query is done. I'd like to do this because the queries can sometimes take quite a while.
In REST, this is trivial. You return a 202 and return a Location header pointing to where the client can go to fetch the result. GraphQL as a specification does not seem to have this notion; it appears to always want requests to be handled synchronously.
Is there any convention for doing things like this in GraphQL? I very much like the query specification but I'd prefer to not leave the client HTTP connection open for up to a few minutes while a large query is executed on the backend. If anything happens to kill that connection the entire query would need to be retried, even if the results themselves are durable.
What you're trying to do is not solved easily in a spec-compliant way. Apollo introduced the idea of a #defer directive that does pretty much what you're looking for but it's still an experimental feature. I believe Relay Modern is trying to do something similar.
The idea is effectively the same -- the client uses a directive to mark a field or fragment as deferrable. The server resolves the request but leaves the deferred field null. It then sends one or more patches to the client with the deferred data. The client is able to apply the initial request and the patches separately to its cache, triggering the appropriate UI changes each time as usual.
I was working on a similar issue recently. My use case was to submit a job to create a report and provide the result back to the user. Creating a report takes couple of minutes which makes it an asynchronous operation. I created a mutation which submitted the job to the backend processing system and returned a job ID. Then I periodically poll the jobs field using a query to find out about the state of the job and eventually the results. As the result is a file, I return a link to a different endpoint where it can be downloaded (similar approach Github uses).
Polling for actual results is working as expected but I guess this might be better solved by subscriptions.
I have a project built using CQRS, but I can't figure out how to implement one use case.
The user needs to be able to make a Query which will return a set of data for them to view. However, I also need to save the data they got at the same time.
Is there a way to do this within a Query without violating CQRS' principles? Or would the Query and Command need to be two separate API calls one after another?
In CQRS it is your client who can do both command and queries. This client is not necessary a piece of UI.
It can be an API endpoint handler, which would
receive a query
forward it to the query endpoint
wait for the answer
send an answer to the caller
send a command to store the answer
Is there a way to do this within a Query without violating CQRS' principles?
It depends.
If "save the data" means "make some change to the domain model"... well, that would be pretty weird.
Asking a question should not change the answer. -- Bertrand Meyer
On the other hand, logging/telemetry are pretty normal ways to track the activity of an application, so that should be fine.
There are some realities of a distributed system on an unreliable network that you need to be aware of (what should the behavior be if the telemetry system is not available? What are the consequences of recording queries that don't actually reach the client (because the network is unreliable).
As #VoiceOfUnreason stated, it may be somewhat strange to effect domain changes when querying data.
However, it may be that you could swop that around.
For instance, perhaps one could query a forecast of sorts. We would want to store that forecast. It then seems as though the query results in us having to save the result. This appears to break CQS at some level since each query would result in a change of state.
If we swop that around and first request a forecast via the domain handling and then that produces a result, or even a pointer to the result, then the query would be something you could perform on the data multiple times without "breaking" CQS.
I am performing an import of data wrapped in a CMSTransactionScope.
What would be the most efficient and practical way to import data in parallel and rollback if any errors? The problem I see is that, with it being parallel, I don't know if I can have the inserted objects be part of the transaction if they are apart of a new thread.
Is there any way to do this or should it be handled differently?
If you're running the code in parallel in order to achieve better performance and you are basically inserting rows one by one then it's unlikely that it'll perform any better than it would while running in a single thread.
In this case I'd recommend using one thread in combination with CMSTransactionScope, and potentially ConnectionHelper.BulkInsert.
Anyway, if you still want to run your queries in parallel then you need to implement some kind of synchronization (locking, for instance) to ensure that all statements are executed before the code hits CMSTransactionScope.Commit() (this basically means a performance loss). Otherwise, queries would get executed in separate transactions. Moreover, you have to make sure that the CMSTransactionScope object always gets instantiated with the same IDataConnection (this should happen by default when you don't pass a connection to the constructor).
The second approach seems error prone to me and I'd rather take a look at different ways of optimizing the code (using async, etc.)
I wrote a quick ruby routine to load some very large csv data. I got frustrated with various out of memory issues trying to use load_csv so reverted to ruby. I'm relatively new to neo4j so trying Neography to just call a cypher query I create as a string.
The cypher code is using merge to add a relationship between 2 existing nodes:
cmdstr=match (a:Provider {npi: xxx}),(b:Provider {npi:yyy}) merge (a)-[:REFERS_TO {qty: 1}]->(b);
#neo.execute_query(cmdstr)
I'm just looping through the rows in a file running these. It fails after about 30000 rows with socket error "cannot assign requested address". I believe GC is somehow causing issues. However the logs don't tell me anything. I've tried tuning GC differently, and trying different amounts of heap. Fails in the same place everytime. Any help appreciated.
[edit]
More info - Running netstat --inet shows thousands of connections to localhost:7474. Does execute_query not reuse connections by design or is this an issue?
I've now tried parameters and the behavior is the same. How would you code this kind of query using batches and make sure you use the index on npi?
I was finally able to get this to work by changing the MERGE to a CREATE (deleting all relationships first). Still took a long time but it stayed linear relative to the number of relationships.
I also changed garbage collection from Concurrent/Sweep to parallelGC. The concurrent sweep would just fail and revert to a full GC anyway.
#wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+UseParallelGC
wrapper.java.additonal=-XX:+UseNUMA
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-Xmn630m
With Neo4j 2.1.3 the LOAD CSV issue is resolved:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "http://npi_data.csv" as line
MATCH (a:Provider {npi: line.xxx})
MATCH (b:Provider {npi: line.yyy})
MERGE (a)-[:REFERS_TO {qty: line.qty}]->(b);
In your ruby code you should use Cypher parameters and probably the transactional API.
Do you limit the concurrency of your requests somehow (e.g. single client)?
Also make sure to have an index or constraint created for your providers:
create index on :Provider(npi);
or
create constraint on (p:Provider) assert p.npi is unique;
I am using mongodb to store user's events, there's a document for every user, containing an array of events. The system processes thousands of events a minute and inserts each one of them to mongo.
The problem is that I get poor performance for the update operation, using a profiler, I notice that the WriteResult.getError is the one that incur the performance impact.
That makes sense, the update is async, but if one wants to retrieve the operation result he needs to wait until the operation is completed.
My question, is there a way to keep the update async, but only get an exception if error occurs (99.999 of the times there is no error, so the system waits for nothing). I understand it means the exception will be raised somewhere further down the process flow, but I can live with that.
Any other suggestions?
The application is written in Java so we're using the Java driver, but I am not sure it's related.
have you done indexing on your records?
it may be a problem to your performance.
if not done before you should do Indexing on ur collection like
db.collectionName.ensureIndex({"event.type":1})
for more help visit http://www.mongodb.org/display/DOCS/Indexes