I am comparing graph databases so I tried oracle graph studio, however when I ran a query that takes less than 20 ms on most other databases (Neptune, neo4j, tigergraph) it takes over a second on graph studio, I tried other queries with similar results (the execution time is greater than 1 second). So I am asking if there is a way to optimize queries in graph studio or is there a certain limitation that requires 1 second to over come so more complex queries on larger databases will be comparable to other competitors as they claim?
Most of my other databases used 2 vcpus while I only use 1 for graph studio so could that also be a factor for the almost 100x slower performance?
Related
I am pretty new to Elasticsearch.
Right know I have pretty expensive search query, it uses nested querying, highlighting, fuzziness and so on. But right now I can't changed that.
So I decided I will try to upgrade my cluster, previously it had 3 nodes, 16GB, 0.5CPU each.
Now I upgraded it to 6 nodes, 16GB, 3CPU each.
I was hoping for some speedup on queries however I did not get it. For some reasons it started to be even slower. For example queries which took 12 seconds now last about 17 seconds.
Is there anything more I have to do after upgrading number of nodes. Because after reading some docs I though it's all automatic.
I will put result of /_nodes query maybe it will give some more insight. I have to use external service to share json because of char limit
https://wtools.io/paste-code/bFUC
I'm using a batch inserter to create a database with about 1 billion nodes and 10 billion relationships. I've read in multiple places that it is preferable to sort the relationships in order min(from, to) (which I didn't do), but I haven't grasped why this practice is optimal. I originally thought this only aided insertion speed, but when I turned the database on, traversal was very slow. I realize there can be many reasons for that, especially with a database this size, but I want to be able to rule out the way I'm storing relationships.
Main question: does it kill traversal speed to insert relationships in a very "random" order because of where they will be stored on disk? I'm thinking that maybe when it tries to traverse nodes, the relationships are too fragmented. I hope someone can enlighten me about whether this would be the case.
UPDATES:
Use-case is pretty much the basic Neo4j friends of friends example using Cypher via the REST API for querying.
Each node (person) is unique and has a bunch of "knows" relationships for who they known. Although I have a billion nodes, all of the 10 billion relationships come from about 30 million of the nodes. So for any starting node I use in my query, it has an average of about 330 relationships coming from it.
In my initial tests, even getting 4 non-ordered friends of friends results was incredibly slow (100+ seconds on average). Of course, after the cache was warmed up for each query, it was fairly quick, but the graph is pretty random and I can't have the whole relationship store in memory.
Some of my system details, if that's needed:
- Neo4j 1.9.RC1
- Running on Linux server, 128gb of RAM, 8 core machine, non-SSD HD
I have not worked with Neo4J on such a large scale, but as far as i know this won't make much difference in the speed. Could you provide any links which state the order of insertion matters.
What matters in this case if the relations are cached or not. Until the cache is fairly populated, performance will be on the slower side. You should also set an appropriate cache size as soon as the index is created.
You should read this link on regarding neo4j performance.
Read the neo4j documentation on batch insert and these SO questions for help with bulk insert if you haven't already read them.
We have a business requirement to show a cost summary for all our projects in a single table.
In order to tabulate these costs we have to query through all the client tasks, regions, job roles, pay rates, cost tables, deliverables, efforts, and hour records (client tasks are in the same table and tasks and regions are in the same table and deliverables, effort, and hours are stored as monthly totals).
Since I have to query all of this before I go for-looping through everything it hits a large number of scripting lines very quickly. Computationally it's like O(m * n * o * p) and some of our projects have all four variables that go up very quickly. My estimates for how to do this have ranged from 90 million lines of code to 600 billion.
Using batch apex we could break this up by task regions into 200 batches, but that would reduce the computational profile to (600 B / 200 ) = 3 billion lines of code (well above the salesforce limit.
We have been playing around with using informatica to do these massive calculations, but we have several problems including (1) our end users can not wait more than five or so minutes, but just transferring the data (90% of all records if all the projects got updated at once) would take 15 minutes over informatica or the web api (2) we have noticed these massive calculations need to happen in several places (changing a deliverable forecast value, creating an initial forecast, etc).
Is there a governor limit work around that will meet our requirements here (massive volume of data with response in 5 or so minutes? Is force.com a good platform for us to use here?
This is the way I've been doing it for a similar calculation:
An ERD would help, but have you considered doing this in smaller pieces and with reports in salesforce instead of custom code?
By smaller pieces I mean, use roll-up summary fields to get some totals higher in your tree of objects.
Or use apex triggers so as hours are entered the cost * hours is calculated and placed onto the time record, and then rolled-up to the deliverables.
Basically get your values calculated at the time the data is entered instead of having to run your calculations every time.
Then you can simply run a report that says show me all my projects and their total cost or total time because those total costs/times are stored/calculated already.
Roll-up summaries only work with master-detail
Triggers work anytime, but you'll want to account for insert, update as well as delete and undelete! Aggregate Functions will be your friend assuming that the trigger context has fewer than 50,000 records to aggregate - which I'd hope it does b/c if there are more than 50,000 time entries for a single deliverable that's a BIG deliverable :)
Hope that helps a bit?
I have a big query on for tables and I want to optimize it.
The weird part is that when I get the execution plan without statistics it says something like 1.2M. however if I get statistics for one of the tables involved in the query, my cost lowers to 4k. But if I ask for statistics in the other tables the cost grows to 50k, so I am not sure what's happening.
Can anyone explain a reason why giving more statistics actually increases query cost?
The Cost Based Optimiser uses as much information as you can give it in order to calculate the cost of a plan. If you update (i.e. change) the statistics it uses, then obviously that will change the calculated cost of the plan.
It's not actually the gathering of stats that causes the cost to grow - it's how those stats have changed (whether up or down) that causes the calculated cost to change.
In the absence of statistics, Oracle may use heuristics, guesswork or a quick sample of the data (depending on the settings in your instance).
Generally, the better (more accurate or representative) the statistics, the more accurate the cost calculation.
The cost based optimizer has it's challenges. There are rounding errors that can have quite an impact on decisions that it makes. This is one of the reasons that SQL Plan Stability, introduced in 11g is so nice. Forget about 10g, if you can, or prepare for long debugging sessions.
At the first use, a plan is generated based on the current statistics and executed. If SQL is repeated, the SQL and the plan are stored in a baseline. In the maintenance window, the most expensive plans are re evaluated and in many cases, a better plan can be provided. This is possible because at runtime, the optimizer is limited in the time it is given to search for a plan. In the maintenance window, a lot more time can be spent to find the best plan.
In 11g the peeking is also fixed and a single SQL can now have multiple plans, based on the values of the bind variables.
The query cost is based on many factors, where IO is a very important factor.
How are your tables filled and where are the high water marks located? A table that is filled and emptied constantly can have it's high watermark far away....
There are lots of bugs in the optimizer, lots of options, controlled by hidden parameters. You could try to use them to tweek the behaviour. Upgrading to 11g might be a lot smarter as it solves lots of performance problems for many applications.
I'm running a fairly substantial SSIS package against SQL 2008 - and I'm getting the same results both in my dev environment (Win7-x64 + SQL-x64-Developer) and the production environment (Server 2008 x64 + SQL Std x64).
The symptom is that initial data loading screams at between 50K - 500K records per second, but after a few minutes the speed drops off dramatically and eventually crawls embarrasingly slowly. The database is in Simple recovery model, the target tables are empty, and all of the prerequisites for minimally logged bulk inserts are being met. The data flow is a simple load from a RAW input file to a schema-matched table (i.e. no complex transforms of data, no sorting, no lookups, no SCDs, etc.)
The problem has the following qualities and resiliences:
Problem persists no matter what the target table is.
RAM usage is lowish (45%) - there's plenty of spare RAM available for SSIS buffers or SQL Server to use.
Perfmon shows buffers are not spooling, disk response times are normal, disk availability is high.
CPU usage is low (hovers around 25% shared between sqlserver.exe and DtsDebugHost.exe)
Disk activity primarily on TempDB.mdf, but I/O is very low (< 600 Kb/s)
OLE DB destination and SQL Server Destination both exhibit this problem.
To sum it up, I expect either disk, CPU or RAM to be exhausted before the package slows down, but instead its as if the SSIS package is taking an afternoon nap. SQL server remains responsive to other queries, and I can't find any performance counters or logged events that betray the cause of the problem.
I'll gratefully reward any reasonable answers / suggestions.
We finally found a solution... the problem lay in the fact that my client was using VMWare ESX, and despite the VM reporting plenty of free CPU and RAM, the VMWare gurus has to pre-allocate (i.e. gaurantee) the CPU for the SSIS guest VM before it really started to fly. Without this, SSIS would be running but VMWare would scale back the resources - an odd quirk because other processes and software kept the VM happily awake. Not sure why SSIS was different, but as I said, the VMWare gurus fixed this problem by reserving RAM and CPU.
I have some other feedback by way of a checklist of things to do for great performance in SSIS:
Ensure SQL login has BULK DATA permission, else data load will be very slow. Also check that the target database uses the Simple or Bulk Logged recovery model.
Avoid sort and merge components on large data - once they start swapping to disk the performance gutters.
Source sorted input data (according to the target table's primary key), and disable non-clustered indexes on target table, set MaximumInsertCommitSize to 0 on the destination component. This bypasses TempDB and log altogether.
If you cannot meet requirements for 3, then simply set MaximumInsertCommitSize to the same size as the data flow's DefaultMaxBufferRows property.
The best way to diagnose performance issues with SSIS Data Flows is with decomposition.
Step 1 - measure your current package performance. You need a baseline.
Step 2 - Backup your package, then edit it. Remove the Destination and replace it with a Row Count (or other end-of-flow-friendly transform). Run the package again to measure performance. Now you know the performance penalty incurred by your Destination.
Step 3 - Edit the package again, removing the next transform "up" from the bottom in the data flow. Run and measure. Now you know the performance penalty of that transform.
Step 4...n - Rinse and repeat.
You probably won't have to climb all the way up your flow to get an idea as to what your limiting factor is. When you do find it, then you can ask a more targeted performance question, like "the X transform/destination in my data flow is slow, here's how it's configured, this is my data volume and hardware, what options do I have?" At the very least, you'll know exactly where your problem is, which stops a lot of wild goose chases.
Are you issuing any COMMITs? I've seen this kind of thing slow down when the working set gets too large (a relative measure, to be sure). A periodic COMMIT should keep that from happening.
First thoughts:
Are the database files growing (without instant file initialization for MDFs)?
Is the upload batched/transactioned? AKA, is it one big transaction?)