neo4j slows down after lots of inserts - spring

I'm the owner of the Blockchain2graph project that reads data from Bitcoin core rest API and insert Blocks, Addresses and Transactions as Graph objects in Neo4j.
After some imports, the process is slowing down until the memory is full. I don't want to use CSV imports. My problem is not performance, my goal is to insert things without the application stopping because of memory (even if it takes quite a lot of time)
I'm using spring-boot-starter-data-neo4j.
In my code, I try to make session.clear from times to times but it doesn't seem to have an impact. After restarting tomcat8, things go fast again.

As your project is about mass inserts, I wouldn't use an OGM like Spring Data Neo4j for writing the data.
You don't want a session to keep your data around on the client.
Instead, use Cypher directly sending updates you get from the BlockChain API directly as a batch per request, see my blog post for some examples (some of which we also use in SDN/Neo4j-OGM under the hood).
You can still use SDN for individual entity handling (CRUD) that's what OGMs are good for in my book to reduce the boilerplate.
But for more complex read operations that have aggregation, filtering, projection and path matches I'd still use Cypher on an annotated repository method, returning rows that can be mapped to a list of DTOs.

Related

Spring Data Query Execution Optimization: Parallel Execution of Hibernate #Query Method in JpaRepository

I have a Dashboard view, which requires small sets of data from tables all over the database. I optimized the database queries (e.g. removed sub-queries). There are now ~20 queries which are executed one after the other, and which are fetching different data sets from the database. Most of the HQL queries contain GROUP BY and JOIN clauses. With a Spring REST interface, the result is returned to the front-end.
How do I optimize the execution of the custom queries? My initial thought was to run the database queries in parallel. But how do I achieve that? After doing some research I found the annotation #Async which makes it possible to run methods in parallel. But does this work with Hibernate methods? Is there always a new database session created for every method annotated with #Query in a JpaRepository? Does running a database query have an effect on the overall execution time after all?
Another way to run the database calls in parallel, is splitting the Dashboard call into several single Ajax calls (every concern gets its own Ajax call). I didn't want to do that, because every time the dashboard is opened (or e.g. the date range is changed), another 20 Ajax calls are made to fetch the new data. And the same question remains: Does running SQL queries in parallel have an effect on the execution time of the database?
I currently did not yet add additional indices to the database. This will be the next thing, I definitely will be doing. However, I'm interested on the performance impacts of running the queries in parallel and on how to achieve this programmatically with Spring.
My project was initially generated by jHipster (Spring Boot, MariaDB, AngularJS etc.)
First, running these SQLs in parallel will not impact the database and it will only make the page load faster, so the design should focus on that.
I am posting this answer assuming that you have already made sure that you cannot combine these 20 SQLs because the data is unrelated (no joins, views, etc).
I would advise against using #Async for 2 reasons.
Reason 1 - An asynchronous task is great when you want to fire a bunch of tasks and forget, or when you know when all the tasks will be complete. So you will need to "wait" for all your asynchronous tasks to complete. How long should you wait? Until the slowest query is done?
Check this sample code for Async (from the guides # spring.io --https://spring.io/guides/gs/async-method/)
// Wait until they are all done
while (!(page1.isDone() && page2.isDone() && page3.isDone())) {
Thread.sleep(10); //10-millisecond pause between each check
}
Will/should your service component wait on 20 Async DAO queries?
Reason 2 - Remember that Async is just spawning off the task as a thread. Since you are going to work with JPA, remember Entity managers are not thread-safe. And DAO classes will propagate transactions. Here is an example of problems that may crop up - http://alexgaddie.blogspot.com/2011/04/spring-3-async-with-hibernate-and.html
IMHO, it is better to go ahead with multiple Ajax calls, because that will make your components cohesive. Yes, you will have 20 endpoints, but they would have a simpler DAO, simpler SQL, easily unit testable and the returned data structure will be easier to handle/parse by the AngularJS widgets. When the UI triggers all 20 Ajax calls, the dashboard would be loading individual widgets when they are ready, instead of loading all of them at the same time. This will help you extend your design in future by optimizing the slower loading sections of your dashboard (maybe caching, indexing, etc).
Bunching your DAO calls will only make the data structure complex and unit testing harder.
Normally it will be much faster to execute the queries in parallel. If you are using Spring data and do not configure anything specific your JPA provider (Hibernate) will create a connection pool that stores connections to your data base. I think by default Hibernate holds 10 connections and by doing so it is prepared to do 10 queries in parallel. How much faster the queries are by running them in parallel depends on the database and the structure of the tables / queries.
I think that using #Async is not the best practice here. Defining 20 REST endpoints that provides the result of a specific query is a much better approach. By doing so you can simple create the Entity, Repository and RestEndpoint class for each query. By doing so each query is isolated and the code is less complex.

Optimistic locking over multiple documents

I need to update some documents at once, like a RDBMS transaction. The best way to do this for a single document in a key-value store like couchbase seems to be using optimistic locking. This would work for me. However, I need to update multiple documents at once.
I need all documents to be updated, or none. Is this possible in couchbase or some similar highly scalable database?
(by the way, I'm using Go)
There are three approaches to resolve it:
You should take another look at your key/document designs and identify if its possible to combine your multiple docs into one. Then you will be able to do a single transactional update in Couchbase.
Simulate Transaction the effect can be simulated by writing a suitable document and view definition that produces the effect while still only requiring a single document update to be applied.
Simulate Multi-phase Transactions to use the transaction record to record each stage of the update process

Torquebox Infinispan Cache - Too many open files

I looked around and apparently Infinispan has a limit on the amount of keys you can store when persisting data to the FileStore. I get the "too many open files" exception.
I love the idea of torquebox and was anxious to slim down the stack and just use Infinispan instead of Redis. I have an app that needs to cache allot of data. The queries are computationally expensive and need to be re-computed daily (phone and other productivity metrics by agent in a call center).
I don't run a cluster though I understand the cache would persist if I had at least one app running. I would rather like to persist the cache. Has anybody run into this issue and have a work around?
Yes, Infinispan's FileCacheStore used to have an issue with opening too many files. The new SingleFileStore in 5.3.x solves that problem, but it looks like Torquebox still uses Infinispan 5.1.x (https://github.com/torquebox/torquebox/blob/master/pom.xml#L277).
I am also using infinispan cache in a live application.
Basically we are storing database queries and its result in cache for tables which are not up-datable and smaller in data size.
There are two approaches to design it:
Use queries as key and its data as value
It leads to too many entries in cache when so many different queries are placed into it.
Use xyz as key and Map as value (Map contains the queries as key and its data as value)
It leads to single entry in cache whenever data is needed from this cache (I call it query cache) retrieve Map first by using key xyz then find the query in Map itself.
We are using second approach.

Neo4j - Using Java plugins to REST api to improve performance?

I am building an application that requires a lot of data constantly being extracted from a local MongoDB to be put into Neo4j. Seeing as I am also having many users access the Neo4j database, from both a Django webserver and other places, I decided on using the REST interface for Neo4j.
The problem I am having is that, even with batch insertion, the Neo4j server is active over 50% of the time with just trying the insert all the data from the mongoDB. As far as I can see there might be some waiting time because of the HTTP requests but I have been trying to tweak but have only gotten so far.
The question is, if I write a Java plugin (http://docs.neo4j.org/chunked/stable/server-plugins.html) that can handle inserting the mongoDB extractions directly, will I then go around the REST API? Or will the java plugin commands just convert to regular REST API requests? Furthermore, will there be a performance boost by using the plugin?
The last question is how do I optimize the speed of the REST API (So far I am performing around 1500 read/write operations which includes many "get_or_create_in_index" operations)? Is there a sweet spot where the number of queries appended to one HTTP requests will keep Neo4j busy until the next HTTP request arrives?
Update:
I am using Neo4j version 2.0
The data that I am extracting consists of bluetooth observations, where the phone that is running the app i created scans all nearby phones. This single observation is then saved as a document in MongoDB and consists of the users id, the time of the scan and a list of the phones/users that he has seen in that scan.
In Neo4j I model all the users as nodes and I also model an observation between two users as a node so that it will look like this:
(user1)-[observed]->(observation_node)-[observed]->(user2)
Furthermore I index all user nodes.
When moving the observation from mongoDB to Neo4j, I do the following for each document:
Check in the index if the user doing the scan already has a node assigned, else create one
Then for each observed user in the scan: A) Check in index if the observed user has a node else create one B) Create an observation node and relationships between the users and the observation node, if this doesn't already exist C) Make a relationship between the observation node and a timeline node (the timeline just consists of a tree of nodes so that I can quickly find observations at a certain time)
As it can be seen I am doing quite a few lookups in the user index (3), some normal read (2-3) and potentially many writes for each observation.
Each bluetooth scan average around 5-30 observations and I batch 100 scans in a single HTTP request. This means that each request usually contains 5000-10000 updates.
What version are you using?
The unmanaged extension would use the underlying Java-API so it much faster, also you can decide on the format & protocol of the data that you push to it.
It is sensible to batch writes, so that you don't incurr tx overhead per each tiny write. E.g. aggregating 10-50k updates in one operation helps a lot.
What is the concrete shape of the updates you do? Can you edit your question to reflect that?
Some resources for this:
http://maxdemarzi.com/2013/09/05/scaling-writes/
http://maxdemarzi.com/2013/12/31/the-power-of-open-source-software/

Need: In memory object database, transactional safety, indices, LINQ, no persistence

Anyone an idea?
The issue is: I am writing a high performance application. It has a SQL database which I use for persistence. In memory objects get updated, then the changes queued for a disc write (which is pretty much always an insert in a versioned table). The small time risk is given as accepted - in case of a crash, program code will resynclocal state with external systems.
Now, quite often I need to run lookups on certain values, and it would be nice to have standard interface. Basically a bag of objects, but with the ability to run queries efficiently against an in memory index. For example I have a table of "instruments" which all have a unique code, and I need to look up this code.... about 30.000 times per second as I get updates for every instrument.
Anyone an idea for a decent high performance library for this?
You should be able to use an in-memory SQLite database (:memory) with System.Data.SQLite.

Resources