Shouldn't i run vacuum analyze together? Should i run vacuum and analyze separately? Assuming large delete and updates are done - greenplum

Assuming large delete and updates are done on a GP table.
Shouldn't i run vacuum analyze together? Should i run vacuum and analyze separately?. The GP best practice document says "Do not run vacuum analyze"
http://gpdb.docs.pivotal.io/4300/pdf/GPDB43BestPractices.pdf
What is the order it runs when i run vacuum analyze? Is there any best practice around the order to run vacuum and analyze? Why Pivotal says do not run both of them together?

It's safer to run ANALYZE before running VACUUM to correct any potential inaccurate row count estimate when a table is significantly bloated.
Because ANALYZE uses block-level sampling, a table with a high ratio of blocks containing no valid rows can cause ANALYZE to set the reltuples column in pg_class catalog table to an inaccurate value, leading to unoptimized query plans. The VACUUM command produces a more accurate count and when run after ANALYZE will correct an inaccurate row count estimate.

Related

Oracle 12c full text index maintainence

We are planning to use a context index for full text search in Oracle 12c standard edition.
The data on which search will run is a JSON containing one Channel post and its replies from a 3rd party tool that is loaded into our database.(basically, all the chats and replies(including other attributes like timestamp/user etc) are stored in this table).
We are expecting about 50k rows of data per year and a daily of 100-150 DMLs per day. Our index is "SYNC ON COMMIT" currently,so what are the recommendations for optimizing the Oracle Text index?
First, let me preface my response with a disclaimer: I am exploring using Oracle Text as part of a POC currently, and my knowledge is somewhat limited as we're still in the research phase. Additionally, our datasets are in the 10s of millions with 100k DML operations daily.
From what I've read, Oracle Docs suggest scheduling both a FULL and REBUILD optimization for indexes which incur DML, so I currently have implemented the following in our dev environment.
execute ctx_ddl.optimize_index('channel_json_ctx_idx', 'FULL'); --run daily
execute ctx_ddl.optimize_index('channel_json_ctx_idx', 'REBUILD'); --run weekly
I cannot imagine with the dataset you've identified that your index will really become that fragmented and cause performance issues. You could probably get away with less frequent optimizations than what I've mentioned.
You could even forego scheduling the optimization and benchmark your performance. If you see it start to degrade, then note the timespan and perhaps count of DML operations for reference. Then run a 'FULL'. Then test performance. If performance improves, create a schedule. If performance does not improve, then run 'REBUILD'. Test performance. Assuming performance improves then you could schedule the 'REBUILD'for that time range and consider adding a more frequent 'FULL'.

Oracle SQL table and index statistics

I've been doing some reading on gathering table and index statistics on Oracle databases but it's left me ... confused.
For the sake of argument, let's assume Oracle 11gR2 as the RDBMS. Regarding gathering table and index statistics, when should it be done, which is the preferred way of doing it, and does Oracle really automatically gather the necessary statistics for us?
Regarding the first point: when should it be done. I've read that, as a rule of thumb, gathering table and index statistics should be done after around 10% of the table's records have been modified (inserted, updated, etc) since the last time the table was analyzed.
Regarding the second point: which is the preferred way of doing it. If we want to calculate both table and index statistics, does executing DBMS_STATS.GATHER_TABLE_STATS with default options, assuming the table is not partitioned, suffice?
Regarding the third point:does Oracle really gather the necessary statistics automatically for us. If this is the case, should i not worry abouth gathering table statistics (see points 1 and 2)?
Thanks in advance.
EDIT: Following the comment by ammoQ, i realized that the question is not clear in what the use case really is, here. My question is about tables that aren' "manipulated" via a user's actions, i.e manually, rather via procedures typically ran by database jobs. Take my example, for instance. My ETL process loads several tables on a daily basis and it does so in approximately 1 hour. Of that 1 hour, about half is spent analyzing the tables themselves. Thus, the tables area analyzed daily, following insertions or updates. This seems overkill, hence the question.
In general, you need to have statistics that are representative (not necessarily accurate) and that give you the right execution plan. By default, Oracle will run a statistics collection job, during the nightly batch window. That may be fine for some applications, but if you have a data warehouse, which presumably includes a regular data load process, then managing the stats should be part of that process. Note that I have said "managing" and not "collecting" statistics. That's just my way of saying that there are other options for statistics in addition to just gathering statistics, although gathering statistics would be where I would start.
There are also things that can be done to optimize statistics gathering, incremental statistics for example. The other thing that is very important is is to use the AUTO Sample size when gathering stats. Do not specify a percentage, not even 100%. The reason is that auto sample size enables a number of internal optimizations and capabilities that are disabled if you do not use AUTO sample size.
So, taking your specific points
10% staleness is pretty random, and is just a number used by the auto stats.
dbms_stats.gather_table_stats() with default values is the preferred method. One parameter that I may change would be the DEGREE, to enable stats gathering in parallel
In 12c, basic stats are gathered on load into an empty table (or empty partition). Stats are built on indexes when indexes are created. So to reiterate what I said above, stats gathering should be part of your ELT process.
I hope that makes sense and helps.

Hive find expected run time of query

I want to find the expected run time of query in Hive. Using EXPLAIN gives the execution plan. Is there a way to find the expected time?
I need Hive equivalent of SQL query EXPLAIN COSTS .
There is no OOTB feature at this moment that facilitates this. One way to achieve this would be to learn from history. Gather patterns based on similar data and queries you have run previously and try to deduce some insights. You might find tools like Starfish helpful in the process.
I would not recommend you to decide anything based on a subset of your data, as running queries on a small dataset and on the actual dataset are very different. This is good to test the functionality but not for any kind of cost approximation. The reason behind this is that a lot of factors are involved in the process, like system resources(disk, CPU slots, N/W etc), system configuration, other running jobs etc. You might find smooth operation on a small dataset, but as the data size increases all these factors start playing much important role. Even a small configuration parameter may play an important role.(You might have noticed sometimes that a Hive query runs fast initially but starts getting slow gradually). Also, execution of a Hive query is much more involved than a simple MR job.
See this JIRA, to get some idea, where they are talking about developing a Cost Based Query optimization for Joins in Hive. You might also find this helpful.
I think that is not possible to because internally map reduce gets executed for any particular Hive query. Moreover map reduce job's execution time depends on the cluster load and its configuration. So it is tough to predict the execution time. May be you can do one thing you can use some timer before running the query and then after that finishes up you can calculate the exact execution time that was needed for execution.
May be you could sample a small % of records from your table using partitions , bucket features etc then run the query against the small dataset. Note the execution time and then multiply with the factor (total_size/sample_size).

Oracle select query performance

I am working on a application. It is in its initial stage so the number of records in table is not large, but later on it will have around 1 million records in the same table.
I want to know what points I should consider while writing select query which will fetch a huge amount of data from table so it does not slow down performance.
First rule:
Don't fetch huge amounts of data back to the application.
Unless you are going to display every single one of the items in the huge amount of data, do not fetch it. Communication between the DBMS and the application is (relatively) slow, so avoid it when possible. It isn't so slow that you shouldn't use the DBMS or anything like that, but if you can reduce the amount of data flowing between DBMS and application, the overall performance will usually improve.
Often, one easy way to do this is to list only those columns you actually need in the application, rather than using 'SELECT *' to retrieve all columns when you'll only use 4 of the 24 that exist.
Second rule:
Try to ensure that the DBMS does not have to look at huge amounts of data.
To the extent possible, minimize the work that the DBMS has to do. It is busy, and typically it is busy on behalf of many people at any given time. If you can reduce the amount of work that the DBMS has to do to process your query, everyone will be happier.
Consider things like ensuring you have appropriate indexes on the table - not too few, not too many. Designed judiciously, indexes can greatly improve the performance of many queries. Always remember, though, that each index has to be maintained, so inserts, deletes and updates are slower when there are more indexes to manage on a given table.
(I should mention: none of this advice is specific to Oracle - you can apply it to any DBMS.)
To get good performance with a database there is a lot of things you need to have in mind. At first, it is the design, and here you should primary think about normalization and denormalization (split up tables but still not as much as performance heavy joins are required).
There are often a big bunch of tuning when it comes to performance. However, 80% of the performance is determined from the SQL-code. Below are some links that might help you.
http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm
http://www.orafaq.com/wiki/Oracle_database_Performance_Tuning_FAQ
A few points to remember:
Fetch only the columns you need to use on the client side.
Ensure you set up the correct indexes that are going to help you find records. These can be done later, but it is better to plan for them if you can.
Ensure you have properly accounted for column widths and data sizes. Don't use an INT when a TINYINT will hold all possible values. A row with 100 TINYINT fields will fetch faster than a row with 100 INT fields, and you'll also be able to fetch more rows per read.
Depending on how clean you need the data to be, it may be permissable to do a "dirty read", where the database fetches data while an update is in progress. This can speed things up significantly in some cases, though it means the data you get might not be the absolute latest.
Give your DBA beer. And hugs.
Jason

Oracle: Difference in execution plans between databases

I am comparing queries my development and production database.
They are both Oracle 9i, but almost every single query has a completely different execution plan depending on the database.
All tables/indexes are the same, but the dev database has about 1/10th the rows for each table.
On production, the query execution plan it picks for most queries is different from development, and the cost is somtimes 1000x higher. Queries on production also seem to be not using the correct indexes for queries in some cases (full table access).
I have ran dbms_utility.analyze schema on both databases recently as well in the hopes the CBO would figure something out.
Is there some other underlying oracle configuration that could be causing this?
I am a developer mostly so this kind of DBA analysis is fairly confusing at first..
1) The first thing I would check is if the database parameters are equivalent across Prod and Dev. If one of the parameters that affects the decisions of the Cost Based Optimizer is different then all bets are off. You can see the parameter in v$parameter view;
2) Having up to date object statistics is great but keep in mind the large difference you pointed out - Dev has 10% of the rows of Prod. This rowcount is factored into how the CBO decides the best way to execute a query. Given the large difference in row counts I would not expect plans to be the same.
Depending on the circumstance the optimizer may choose to Full Table Scan a table with 20,000 rows (Dev)where it may decide an index is lower cost on the table that has 200,000 rows (Prod). (Numbers just for demonstration, the CBO uses costing algorighms for determining what to FTS and what to Index scan, not absolute values).
3) System statistics also factor into the explain plans. This is a set of statistics that represent CPU and disk i/o characteristics. If your hardware on both systems is different then I would expect your System Statistics to be different and this can affect the plans. Some good discussion from Jonathan Lewis here
You can view system stats via the sys.aux_stats$ view.
Now I'm not sure why different plans are a bad thing for you... if stats are up to date and parameters set correctly you should be getting decent performance from either system no matter what the difference in size...
but it is possible to export statistics from your Prod system and load them into your Dev system. This make your Prod statistics available to your Dev database.
Check the Oracle documentation for the DBMS_STATS package, specifically the EXPORT_SCHEMA_STATS, EXPORT_SYSTEM_STATS, IMPORT_SCHEMA_STATS, IMPORT_SYSTEM_STATS procedures. Keep in mind you may need to disable the 10pm nightly statistics jobs on 10g/11g... or you can investigate Locking statistics after import so they are not updated by nightly jobs.

Resources