Speed decrease on volatile neo4j dataset - performance

I'm using neo4j in one of my projects and have noticed that my local database (used to run test suites) becomes slower and slower over the course of time. This is a low-priority issue, as it currently does not seem to occur during real-world use (outside of running huge test suites), but for the goal of improving neo4j I figured it be best to post it nonetheless :)
As it currently stands, these are my findings:
the speed decrease is linked to the amount of tests executed (and therefore, the amount of created/deleted nodes)
the db size increases, even though each test suite clears the database* after use (indicating dead nodes remain)
deleting the graph.db file solves the issue (further proof for the dead nodes theory)
Although the problem can be easily solved in an acceptable way for a test database, I'm still worried about the production implications of this symptom for long running databases with volatile data. Granted, having a database with data as volatile as the test data is a border case, however it shouldn't be a problem at all. At minimum a solution which is production ready (I'm thinking dead node pruning) should be available, however I can find nothing of the sorts in the documentation.
Is this a known issue? I couldn't find any reference to similar issues. Any help in locating the exact cause would be greatly appreciated, as I'd like to contribute a patch if I can find (and solve) the actual problem.
*) the database is cleared using two separate cypher commands (to prevent occasional occurrences of issue 27) the following cyphers are run in order: MATCH ()-[r]-() DELETE r MATCH (n) DELETE n

I've experienced the same behavior as well. We were running a heavy calculation script every 15 minutes on the entire database. That produced huge (logical) log files that seemed to decrease the performance. In order to reduce the log files, you need to set the keep_logical_logs property. For tests, the following might be a good setting:
keep_logical_logs=24 hours
For tests, you'd also want to consider the ImpermanentantGraphDatabase if an embedded database an option. You can get it with
<dependency>
<groupId>org.neo4j</groupId>
<artifactId>neo4j-kernel</artifactId>
<version>2.0.1</version>
<classifier>tests</classifier>
</dependency>

Related

Random/Inconsistent Code Run Times - Parallel HPC

I've been running some tests on an HPC. I have a code and if it's executed in serial, the run times are completely consistent. This wasn't always the case, but I included commands in my batch files so that it reserves an entire node and all its memory. Doing this allowed for almost perfectly consistent code execution times.
However, now that I am doing small scale parallel tests, the code execution times seem random. I would expect there to be some variation now that parallelization has been introduced, but the scale of randomness seems quite bizarre.
No other jobs are performed on the node so it should be fine - when in serial it is very consistent, so it must be something to do with the parallelization.
Does anyone know what could cause this? I've included a graph showing the execution times - there is a pretty clear average, but also major outliers. All results produced are identical and correct.
I'm under an NDA so cannot include much info about my code. Please feel free to ask questions and I'll see if I can help. Apologies if I'm not allowed to answer!
I'm using Fortran 90 as the main code language, and the HPC uses Slurm. NTASKS = 8 for these tests, however the randomness is there if NTASKS > 1. Number of tasks and randomness don't seem particularly linked, except if it is in parallel then the randomness occurs. I'm using Intel's autoparallelization feature, rather than OpenMP/MPI.
Thanks in advance
SOLVED!!!! Thanks for your help everyone!
I did small scale tests 100 times to get to the root of the problem.
As the execution times were rather small, I noticed that the larger outliers (longer run times) often occurred when a lot of new jobs from other users were submitted to the HPC. This made a lot of sense and wasn't particularly surprising.
The main reason these results really confused me was because of the smaller outliers (much quicker run times). It made sense that sometimes it would take longer to run if it was busy, but I just couldn't figure out how sometimes it ran much quicker, but still giving the same results!
Probably a bit of a rookie error, but it turns out not all nodes are equal on our HPC! About 80% of the nodes are identical, giving roughly the same run times (or longer if busy). BUT, the newest 20% (i.e. the highest node numbers with Node0XX) must be higher performance.
I checked the 'sacct' Slurm data and job run times, and can confirm all faster execution time outliers occurred on these newer nodes.
Very odd situation and something I hadn't been made aware of. Hopefully this might be able to help someone out if they're in a similar situation. I spent so long checking source codes/batchfiles/code timings that I hadn't even considered the HPC hardware itself. Something to keep in mind.
I did much longer (about an hour) tests and the longer execution time outliers didn't really exist (because the small queuing penalty was now relatively small in comparison to total execution time). But, the much quicker execution time outliers still occurred. Again, I checked the account data and these outliers always occurred on the high node numbers.
Hopefully this can help at least one person with a similar headache.
Cheers!

Hazelcast: What would be the implications of adding indexes to huge existing IMaps?

Given 4-5 nodes having many IMaps with lots of data in it, some of the predicate queries started to become significantly slow. One of the solutions for solving this performance issue (as I think) could be adding indexes. However, this data is part of a sensible system which is currently being used in production.
Before adding indexes, I was wondering what would be the consequences of doing it on huge IMaps? (would it lock the entire map ?; would it bring down the entire system?; etc.) Hazelcast documentation includes information about how to do it, but doesn't give any other explanation.
If you want to add the index in runtime this is what will happen:
the AddIndexOperation will be executed on every partition
during the execution of the AddIndexOperation the partition will be blocked until all partition data are iterated and added to the index.
Queries won't be blocked in this timeframe - but get/put operations will.
I would recommend doing it in the "maintenance window" where you have the smallest load.
lots of data is relative - just execute a test in your dev environment having exactly the same amount of data to see how long it will take to add an index in your environment.

How to test tuned oracle sql and how to clear clear system/hardware buffer?

I want to know the way to test right sqls before tuned and after tuned.
but once I executed the original sql, I got results too fast for tuned sql.
I found below...
How to clear all cached items in Oracle
I did flush data buffer cache and shared pool but it still didn't work.
I guess this answer from that question is related to what I want to know more:
Keep in mind that the operating system and hardware also do caching which can skew your results.
Oracle's version is 11g and Server is HP-UX 11.31.
If the server was Linux, I could've tried clearing buffer using '/proc/sys/vm/drop_caches'.(I'm not sure it would works)
I'm searching quite long time for this problem. Is there anyone has this kind of problem?
thanks
If your query is such that the results are being cached in the file system, which your description would suggest, then the query is probably not a "heavy-hitter" overall. But if you were testing in isolation, with not much activity on the database, when the SQL is run in a production environment performance could suffer.
There are several things you can do to determine which version of two queries is better. In fact, entire books have been written on just this topic. But to summarize:
Before you begin, ensure statistics on the tables and indexes are up to date.
See how often the SQL will be executed in the grand scheme of things. If it runs once or twice a day, and takes 2 seconds to run, don't bother trying to tune.
Do a explain plan on both and look at the estimated costs and number of steps.
Turn on tracing for both optimizer steps and execution statistics, and compare.

Hibernate Search Automatic Indexing

I am working on developing an application which caters to about 100,000 searches everyday. We can safely assume that there are about the same number of updates / insertions / deletions in the database daily. The current application uses native SQL and we intend to migrate it to Hibernate and use Hibernate Search.
As there are continuous changes in the database records, we need to enable automatic indexing. The management has concerns about the performance impact automatic indexing can cause.
It is not possible to have a scheduled batch indexing as the changes in the records have to be available for search as soon as they are changed.
I have searched to look for some kind of performance statistics but have found none.
Can anybody who has already worked on Hibernate Search and faced a similar situation share their thoughts?
Thanks for the help.
Regards,
Shardul.
It might work fine, but it's hard to guess without a baseline. I have experience with even more searches / day and after some fine tuning it works well, but it's impossible to know if that will apply for your scenario without trying it out.
If normal tuning fails and NRT doesn't proof fast enough, you can always shard the indexes, use a multi-master configuration and plug in a distributed second level cache such as Infinispan: all combined the architecture can achieve linear scalability, provided you have the time to set it up and reasonable hardware.
It's hard to say what kind of hardware you will need, but it's a safe bet that it will be more efficient than native SQL solutions. I would suggest to make a POC and see how far you can get on a single node; if the kind of queries you have are a good fit for Lucene you might not need more than a single server. Beware that Lucene is much faster in queries than in updates, so since you estimate you'll have the same amount of writes and searches the problem is unlikely to be in the amount of searches/second, but in the writes(updates)/second and total data(index) size. Latest Hibernate Search introduced an NRT index manager, which suites well such use cases.

Is there a major performance gain by using stored procedures?

Is it better to use a stored procedure or doing it the old way with a connection string and all that good stuff? Our system has been running slow lately and our manager wants us to try to see if we can speed things up a little and we were thinking about changing some of the old database calls over to stored procedures. Is it worth it?
The first thing to do is check the database has all the necessary indexes set up. Analyse where your code is slow, and examine the relevant SQL statements and indexes relating to them. See if you can rewrite the SQL statement to be more efficient. Check that you aren't recompiling an SQL (prepared) statement for every iteration in a loop instead of outside it once.
Moving an SQL statement into a stored procedure isn't going to help if it is grossly inefficient in implementation. However the database will know how to best optimise the SQL and it won't need to do it repeatedly. It can also make the client side code cleaner by turning a complex SQL statement into a simple procedure call.
I would take a quick look at Stored Procedures are EVIL.
So long as your calls are consistent the database will store the execution plan (MS SQL anyway). The strongest remaining reason for using stored procedures are for easy and sure security management.
If I were you I'd first be looking for adding indices where required. Also run a profiling tool to examine what is taking long and if that sql needs to changed, e.g. adding more Where clauses or restricting result set.
You should consider caching where you can.
Stored procedures will not make things faster.
However, rearranging your logic will have a huge impact. The tidy, focused transactions that you design when thinking of stored procedures are hugely beneficial.
Also, stored procedures tend to use bind variables, where other programming languages sometimes rely on building SQL statements on-the-fly. A small, fixed set of SQL statements and bind variables is fast. Dynamic SQL statements are slow.
An application which is "running slow lately" does not need coding changes.
Measure. Measure. Measure. "slow" doesn't mean much when it comes to performance tuning. What is slow? Which exact transaction is slow? Which table is slow? Focus.
Control all change. All. What changed? OS patch? RDBMS change? Application change? Something changed to slow things down.
Check for constraints in scale. Is a table slowing down because 80% of the data is history that you use for reporting once a year?
Stored procedures are never the solution to performance problems until you can absolutely point to a specific block of code which is provably faster as a stored procedure.
stored procedures can be really help if they avoid sending huge amounts of data and/or avoid doing roundtrips to the server,so they can be valuable if your application has one of these problems.
After you finish your research you will realize there are two extreme views at opposite side of the spectrum. Historically the Java community has been against store procs due to the availability of frameworks such as hibernate, conversely the .NET community has used more stored procs and this legacy goes as far as the vb5/6 days. Put all this information in context and stay away from the extreme opinions on either side of the coin.
Speed should not be the primary factor to decide against or in favor of stored procs. You can achieve sp performace using inline SQL with hibernate and other frameworks. Consider maintenance and which other programs such as reports, scripts could use the same stored procs used by your application. If your scenario requires multiple consumers for the same SQL code, stored procedures are a good candidate, maintenance will be easier. If this is not the case, and you decide to use inline sql, consider externalizing it in config files to facilitate maintenance.
At the end of the day, what counts is what will make your particular scenario a success for your stakeholders.
If your server is getting noticeably slower in your busy season it may be because of saturation rather than anything inefficent in the database. Basic queuing theory tells us that a server gets hyperbolically slower as it approaches saturation.
The basic relationship is 1/(1-X) where X is the proportion of load. This describes the average queue length or time to wait before being served. Therefore a server that is getting saturated will slow down very rapidly when the load spikes.
A server that is 25% loaded will have an average service time of 1.333K for some constant K (loosely, K is the time for the machine to perform one transaction). A server that is 50% loaded will have an average service time of 2K and a server that is 90% loaded will have an average service time of 10K. Given that the slowdowns are hyperbolic in nature, it often doesn't take a large change in overall load to produce a significant degradation in response time.
Obviously this is somewhat simplistic as the server will be processing multiple requests concurrently (there are more elaborate queuing models for this situation), but the broad principle still applies.
So, if your server is experiencing transient loads that are saturating it, you will experience patches of noticeable slow-down. Note that these slow-downs need only be in one bottlenecked area of the system to slow the whole process down. If you are only experiencing this now in a busy season there is a possibility that your server has simply hit a constraint on a resource, rather than being particularly slow or inefficient.
Note that this possibility is not antithetical to the possibility of inefficiencies in the code. You may find that the way to ease the bottleneck is to tune some of your queries.
In order to tell if the system is bottlenecked, start gathering profiling information. If you can find resources with a large number of waits, this should give you a good starting point.
The final possibility is that you need to upgrade your server. If there are no major inefficiencies in the code (this might well be the case if profiling doesn't indicate any disproportionately large bottlenecks) you may simply need bigger hardware. I have no idea what your volumes are, but don't discount the possibility that you may have outgrown your server.
Yes, stored procs is a step forward towards acheiving good performance. The main reason is that stored procedures can be pre-compiled and their execution plan cached.
You however need to first analyse where your performance bottlenecks are really - so that you approach this exercise in a structured way.
As it has been suggested in one of the responses, try analyse using a profiler tool where the problem is - e.g do you need to create indexes...
Cheers
Like all of the above posts suggest, you first want to clean up your SQL statements, have appropriate indexes. caching can be tricky, I cant comment unless I have more detail on what you are trying to accomplish.
But one thing about sprocs, make sure you dont let it generate dynamic SQL statements
because for one, it will be pointless and it can be subjected to SQL Injection attacks...this has happened in one of the projects I looked into.
I would recommend sprocs for updates mainly, and then select statements.
good luck :)
You can never say in advance. You must do it and measure the difference because in 9 out of 10 cases, the bottleneck is not where you think.
If you use a stored procedure, you don't have to transmit the data. DBs are usually slow at executing [EDIT]complex[/EDIT] stored procedures [EDIT]with loops, higher math, etc[/EDIT]. So it really depends on how much work you would need to do, how slow your network is, how fast the DB executes this particular code, etc.

Resources