Vacuuming and Analyse : Drastic change in query cost - performance

I have a Postgres 9.6 installation and I am running into this weird case where - if I run a same query having multiple joins after 10 to 15 mins, there is increase in the value of query cost in the order of few hundreds and its keep on increasing.
I do understand what vacuuming and analyse does, but I am worried about the query cost which starts increases within few minutes of performing vacuum and analyse. I am afraid this might lead do future performance bottlenecks.
PS: I have two table out of which one is heavily written (about 5 million records ) and other is heavily updated (70 K records with postGIS this table mostly have updates on lat lon & geom column)
Does this means I should have auto vacuum run every few hours?

Make Autovaccum aggressive; but if you think autovaccum is using up resources(by looking at cpu ]and IO usage) you could tweak-- autovacuum_vacuum_cost_delay and autovacuum_vacuum_threshold paramters at table level

Related

Why these 2 similar queries in Snowflake have very different performance?

See these 2 Snowflake queries profile images. They are doing similar work (Update the same 370M table join with small tables(one case is 21k, the other one is 9k), but the performance result is 5x).
The query finished around 15 mins, using one xsmall VDW:
Fast query finished around 15 mins
And this query, update the same table of 370M rows, join with an even small DIM table of 9k, but still running after 1 hour 30 mins
Still, running after 90 minutes
From the query profile, I cannot explain why the 2nd query runs so much slower than the first one. The 2nd one is run right after the first one.
Any idea? Thanks
in the second query you can see bytes spilled to local storage is 272gb. This means that the work done in processing was too large to fit in the cluster memory and so had to spill to local attached SSD. From a performance perspective this is a costly operation and I think probably why the 2nd query took so long to run (query 1 only had 2gb of spilling). The easiest solution to this is to increase the size of the VDW - or you could rewrite the query:
https://docs.snowflake.net/manuals/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Note also that query 1 managed to read 100% of its data set from VDW memory - which is very efficient - whereas query2 could only find about half of its data set there and so had to perform remote io (read from cloud storage) to get the rest. Queries/work performed prior to running query 1 and 2 had retrieved that information to the local VDW cache, and retains this info on an LRU basis.
The join for the slow query is producing more rows than are flowing into it. This can be what you want, but often it's caused by duplicate values in the tables. I'd do a sanity check on whether that's expected here.

why all the sudden computed columns started to slow down the performance?

Users were able to run reports before 10 am. After that same reports became very slow, sometimes users just didn't have patience to wait. After some troubleshooting I fount the column that was causing the delay. It was computed column that uses function in order to bring the result.
Approximately at the same time I got another complain about slow running report, that was always working fine. After some troubleshooting I found the columns that was causing a delay:
where (Amount - PTD) <> 0
And again, the Amount column is computed column.
So my questions are:
why all of the sudden computed columns that was always part of the reports started to slow down the performance significantly? Even when nobody using database.
What could really happen approx after 10 am?
And what is the disadvantage if I make those columns persisted?
Thank you
You don't provide a lot of detail here - so I can only answer in generalities.
So, in general - database performance tends to be determined by bottlenecks. A query might run fine on a table with 1 records, 10 records, 1000 records, 100000 records - and then at 100001 records, it suddenly gets slow. This is because you've exceeded some boundary in the system - for instance, the data doesn't fit in memory anymore.
It's really hard to identify those bottlenecks, and even harder to predict - but keep an eye on perfmon, and see what your CPU, disk i/o and memory stats are doing.
Computed columns are unlikely to be a problem in their own right - but using them in a "where" statement (especially with another calculation) is likely to be slow if you don't have an index on that column. In your example, you might create another computed column for (Amount - PTD) and create an index on that column too.

Migrating data between databases not increasing a performance

I used to have a PostgreSQL 9.2 database with 3 tables:
A - contains 12 millions records
B - contains 24 millions records
C - contains 20 millions records
Tables are connected like:
A (one to many) B
B (one to zero/one) C
I have decieded to archive/migrate older data to 2nd database to speed up my main database (less data = better performance).
After I have migrated about 20% of data from every table, I have done VACUUM ANALYZE on my main database tables to clean up a little bit.
I thought that next 20% will be much faster to migrate.... I was wrong. Every next percent of data to archive process slower and slower...
I thought maybe VAACUM FULL is needed here, but I have read it is not recommended to it. What is more it is a very slow query and requires almost double of disk space (it creates a new table then delete old one).
What can be a reason of slower processing despite the less data left? Am I missing some step which can increase my database speed after migration? Some kind of clean up other then VACUUM ANALYZE
Need to specify that I have measure time of processing 3 steps: selecting a data to copy from main database, inserting into 2nd database, delete from main database.
Selecting a data is a problem.
About archiving process:
I select a A table rows older then x days. Copy it and remove then.
Then I select a B rows connected to A rows selected before. Copy it and remove then.
Last I select a C rows connected to B rows selected before. Copy it and remove then.
Conf:
8GB RAM.
max_connections = 100
shared_buffers = 2GB
effective_cache_size = 6GB
work_mem = 32MB
maintenance_work_mem = 512MB
checkpoint_segments = 32
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 2.0
Try to figure out where the time is spent. Is it the SELECT to find the rows in B and C? Is it the DELETE?
Once you have found the problematic statement, look at the EXPLAIN (ANALYZE) output for it; it will tell you were the time is spent.
Deleting rows from a table does not make it smaller and does not necessarily speed up queries on the table. What may help is VACUUM (FULL), particularly if there are sequential scans. You don't have to run it on all tables in the database; if space is a problem, you can tun it on one table after the other.
But first look at the execution plans to see if that will help at all.

When is the right time to create Indexes in Oracle?

A brand new application with Oracle as DataStore is going to be pushed in Production. The Databases use CBO and I have indentified some columns to do indexing. I am expecting the total number of records in a particular table to be 4 million after 6 months. After that very few records will be added and there will not be any updates in the records of Indexed columns. I mean most of the updates will be on NonIndexed columns.
Is it advisable to create Index now? or I need to wait for a couple of months?
If table requires indexes, you will incur a lot of poor performance (full table scan + actual I/O) after the number of rows in the table goes beyond what might reasonably be kept the cache. Assume that is 20000 rows. We'll call it magic number. You hit 20000 rows in a week of production. After that the queries and updates on the table will grow progressively slower, on average, as more rows are added.
You are probably worried about the overhead of inserting new rows with indexed fields. That is a one-time hit. You a trading that against dozens of queries and updates when you delay adding indexes.
The trade off is largely in favor of adding indexes right now. Especially since we do not know what that magic number (20000?) really is. Could be larger. Or smaller.

governor limits with reports in SFDC

We have a business requirement to show a cost summary for all our projects in a single table.
In order to tabulate these costs we have to query through all the client tasks, regions, job roles, pay rates, cost tables, deliverables, efforts, and hour records (client tasks are in the same table and tasks and regions are in the same table and deliverables, effort, and hours are stored as monthly totals).
Since I have to query all of this before I go for-looping through everything it hits a large number of scripting lines very quickly. Computationally it's like O(m * n * o * p) and some of our projects have all four variables that go up very quickly. My estimates for how to do this have ranged from 90 million lines of code to 600 billion.
Using batch apex we could break this up by task regions into 200 batches, but that would reduce the computational profile to (600 B / 200 ) = 3 billion lines of code (well above the salesforce limit.
We have been playing around with using informatica to do these massive calculations, but we have several problems including (1) our end users can not wait more than five or so minutes, but just transferring the data (90% of all records if all the projects got updated at once) would take 15 minutes over informatica or the web api (2) we have noticed these massive calculations need to happen in several places (changing a deliverable forecast value, creating an initial forecast, etc).
Is there a governor limit work around that will meet our requirements here (massive volume of data with response in 5 or so minutes? Is force.com a good platform for us to use here?
This is the way I've been doing it for a similar calculation:
An ERD would help, but have you considered doing this in smaller pieces and with reports in salesforce instead of custom code?
By smaller pieces I mean, use roll-up summary fields to get some totals higher in your tree of objects.
Or use apex triggers so as hours are entered the cost * hours is calculated and placed onto the time record, and then rolled-up to the deliverables.
Basically get your values calculated at the time the data is entered instead of having to run your calculations every time.
Then you can simply run a report that says show me all my projects and their total cost or total time because those total costs/times are stored/calculated already.
Roll-up summaries only work with master-detail
Triggers work anytime, but you'll want to account for insert, update as well as delete and undelete! Aggregate Functions will be your friend assuming that the trigger context has fewer than 50,000 records to aggregate - which I'd hope it does b/c if there are more than 50,000 time entries for a single deliverable that's a BIG deliverable :)
Hope that helps a bit?

Resources