Cost of a query in/dependent of amount of data - performance

Could you please tell me whether the cost of a query is dependent on the amount of data available in the database at that time?
means, does the cost varies with the variation in the amount of data?
Thanks,
Savitha

The answer is, Yes, the data size will influence the query execution plan, that is why you must test your queries with real amounts of data (and if possible realistic data as the distribution of the data is also important and will influence the query cost).

Any Database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.
Of course, mass data will Slow down the process, In fact If u are firing a query, it need to traverse and search into the database. For more data it ll take time, The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes and Joins..

Related

How many table partitions is too many in Postgres?

I'm partitioning a very large table that contains temporal data, and considering to what granularity I should make the partitions. The Postgres partition documentation claims that "large numbers of partitions are likely to increase query planning time considerably" and recommends that partitioning be used with "up to perhaps a hundred" partitions.
Assuming my table holds ten years of data, if I partitioned by week I would end up with over 500 partitions. Before I rule this out, I'd like to better understand what impact partition quantity has on query planning time. Has anyone benchmarked this, or does anyone have an understanding of how this works internally?
The query planner has to do a linear search of the constraint information for every partition of tables used in the query, to figure out which are actually involved--the ones that can have rows needed for the data requested. The number of query plans the planner considers grows exponentially as you join more tables. So the exact spot where that linear search adds up to enough time to be troubling really depends on query complexity. The more joins, the worse you will get hit by this. The "up to a hundred" figure came from noting that query planning time was adding up to a non-trivial amount of time even on simpler queries around that point. On web applications in particular, where latency of response time is important, that's a problem; thus the warning.
Can you support 500? Sure. But you are going to be searching every one of 500 check constraints for every query plan involving that table considered by the optimizer. If query planning time isn't a concern for you, then maybe you don't care. But most sites end up disliking the proportion of time spent on query planning with that many partitions, which is one reason why monthly partitioning is the standard for most data sets. You can easily store 10 years of data, partitioned monthly, before you start crossing over into where planning overhead starts to be noticeable.
"large numbers of partitions are likely to increase query planning time considerably" and recommends that partitioning be used with "up to perhaps a hundred" partitions.
Because every extra partition will usually be tied to check constraints, and this will lead the planner to wonder which of the partitions need to be queried against. In a best case scenario, the planner identifies that you're only hitting a single partition and gets rid of the append step altogether.
In terms of rows, and as DNS and Seth have pointed out, your milage will vary with the hardware. Generally speaking, though, there's no significant difference between querying a 1M row table and a 10M row table -- especially if your hard drives allow for fast random access and if it's clustered (see the cluster statement) using the index that you're most frequently hitting.
Each Table Partition takes up an inode on the file system. "Very large" is a relative term that depends on the performance characteristics of your file system of choice. If you want explicit performance benchmarks, you could probably look at various performance benchmarks of mails systems from your OS and FS of choice. Generally speaking, I wouldn't worry about it until you get in to the tens of thousands to hundreds of thousands of table spaces (using dirhash on FreeBSD's UFS2 would be win). Also note that this same limitation applies to DATABASES, TABLES or any other filesystem backed database object in PostgreSQL.
If you don't want to trust the PostgreSQL developers who wrote the code, then I recommend that you simply try it yourself and run a few example queries with explain analyze and time them using different partition schemes. Your specific hardware and software configuration is likely to dominate any answer in any case.
I'm assuming that the row optimization cache which the query optimizer uses to determine what joins and restrictions to use is stored with each partition, so it probably needs to load and read parts of each partition to plan the query.

when I try to use Oracle Statistics my cost grows?

I have a big query on for tables and I want to optimize it.
The weird part is that when I get the execution plan without statistics it says something like 1.2M. however if I get statistics for one of the tables involved in the query, my cost lowers to 4k. But if I ask for statistics in the other tables the cost grows to 50k, so I am not sure what's happening.
Can anyone explain a reason why giving more statistics actually increases query cost?
The Cost Based Optimiser uses as much information as you can give it in order to calculate the cost of a plan. If you update (i.e. change) the statistics it uses, then obviously that will change the calculated cost of the plan.
It's not actually the gathering of stats that causes the cost to grow - it's how those stats have changed (whether up or down) that causes the calculated cost to change.
In the absence of statistics, Oracle may use heuristics, guesswork or a quick sample of the data (depending on the settings in your instance).
Generally, the better (more accurate or representative) the statistics, the more accurate the cost calculation.
The cost based optimizer has it's challenges. There are rounding errors that can have quite an impact on decisions that it makes. This is one of the reasons that SQL Plan Stability, introduced in 11g is so nice. Forget about 10g, if you can, or prepare for long debugging sessions.
At the first use, a plan is generated based on the current statistics and executed. If SQL is repeated, the SQL and the plan are stored in a baseline. In the maintenance window, the most expensive plans are re evaluated and in many cases, a better plan can be provided. This is possible because at runtime, the optimizer is limited in the time it is given to search for a plan. In the maintenance window, a lot more time can be spent to find the best plan.
In 11g the peeking is also fixed and a single SQL can now have multiple plans, based on the values of the bind variables.
The query cost is based on many factors, where IO is a very important factor.
How are your tables filled and where are the high water marks located? A table that is filled and emptied constantly can have it's high watermark far away....
There are lots of bugs in the optimizer, lots of options, controlled by hidden parameters. You could try to use them to tweek the behaviour. Upgrading to 11g might be a lot smarter as it solves lots of performance problems for many applications.

Doing analytical queries on large dynamic sets of data

I have a requirement where I have large sets of incoming data into a system I own.
A single unit of data in this set has a set of immutable attributes + state attached to it. The state is dynamic and can change at any time.
The requirements are as follows -
Large sets of data can experience state changes. Updates need to be fast.
I should be able to aggregate data pivoted on various attributes.
Ideally - there should be a way to correlate individual data units to an aggregated results i.e. I want to drill down into the specific transactions that produced a certain aggregation.
(I am aware of the race conditions here, like the state of a data unit changing after an aggregation is performed ; but this is expected).
All aggregations are time based - i.e. sum of x on pivot y over a day, 2 days, week, month etc.
I am evaluating different technologies to meet these use cases, and would like to hear your suggestions. I have taken a look at Hive/Pig which fit the analytics/aggregation use case. However, I am concerned about the large bursts of updates that can come into the system at any time. I am not sure how this performs on HDFS files when compared to an indexed database (sql or nosql).
You'll probably arrive at the optimal solution only by stress testing actual scenarios in your environment, but here are some suggestions. First, if write speed is a bottleneck, it might make sense to write the changing state to an append-only store, separate from the immutable data, then join the data again for queries. Append-only writing (e.g., like log files) will be faster than updating existing records, primarily because it minimizes disk seeks. This strategy can also help with the problem of data changing underneath you during queries. You can query against a "snapshot" in time. For example, HBase keeps several timestamped updates to a record. (The number is configurable.)
This is a special case of the persistence strategy called Multiversion Concurrency Control - MVCC. Based on your description, MVCC is probably the most important underlying strategy for you to perform queries for a moment in time and get consistent state information returned, even while updates are happening simultaneously.
Of course, doing joins over split data like this will slow down query performance. So, if query performance is more important, then consider writing whole records where the immutable data is repeated along with the changing state. That will consume more space, as a tradeoff.
You might consider looking at Flexviews. It supports creating incrementally refreshable materialized views for MySQL. A materialized view is like a snapshot of a query that is updated periodically with the data which has changed. You can use materialized views to summarize on multiple attributes in different summary tables and keep these views transactionally consistent with each other. You can find some slides describing the functionality on slideshare.net
There is also Shard-Query which can be used in combination with InnoDB and MySQL partitioning, as well as supporting spreading data over many machines. This will satisfy both high update rates and will provide query parallelism for fast aggregation.
Of course, you can combine the two together.

Oracle select query performance

I am working on a application. It is in its initial stage so the number of records in table is not large, but later on it will have around 1 million records in the same table.
I want to know what points I should consider while writing select query which will fetch a huge amount of data from table so it does not slow down performance.
First rule:
Don't fetch huge amounts of data back to the application.
Unless you are going to display every single one of the items in the huge amount of data, do not fetch it. Communication between the DBMS and the application is (relatively) slow, so avoid it when possible. It isn't so slow that you shouldn't use the DBMS or anything like that, but if you can reduce the amount of data flowing between DBMS and application, the overall performance will usually improve.
Often, one easy way to do this is to list only those columns you actually need in the application, rather than using 'SELECT *' to retrieve all columns when you'll only use 4 of the 24 that exist.
Second rule:
Try to ensure that the DBMS does not have to look at huge amounts of data.
To the extent possible, minimize the work that the DBMS has to do. It is busy, and typically it is busy on behalf of many people at any given time. If you can reduce the amount of work that the DBMS has to do to process your query, everyone will be happier.
Consider things like ensuring you have appropriate indexes on the table - not too few, not too many. Designed judiciously, indexes can greatly improve the performance of many queries. Always remember, though, that each index has to be maintained, so inserts, deletes and updates are slower when there are more indexes to manage on a given table.
(I should mention: none of this advice is specific to Oracle - you can apply it to any DBMS.)
To get good performance with a database there is a lot of things you need to have in mind. At first, it is the design, and here you should primary think about normalization and denormalization (split up tables but still not as much as performance heavy joins are required).
There are often a big bunch of tuning when it comes to performance. However, 80% of the performance is determined from the SQL-code. Below are some links that might help you.
http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm
http://www.orafaq.com/wiki/Oracle_database_Performance_Tuning_FAQ
A few points to remember:
Fetch only the columns you need to use on the client side.
Ensure you set up the correct indexes that are going to help you find records. These can be done later, but it is better to plan for them if you can.
Ensure you have properly accounted for column widths and data sizes. Don't use an INT when a TINYINT will hold all possible values. A row with 100 TINYINT fields will fetch faster than a row with 100 INT fields, and you'll also be able to fetch more rows per read.
Depending on how clean you need the data to be, it may be permissable to do a "dirty read", where the database fetches data while an update is in progress. This can speed things up significantly in some cases, though it means the data you get might not be the absolute latest.
Give your DBA beer. And hugs.
Jason

DB Index speed vs caching

We have about 10K rows in a table. We want to have a form where we have a select drop down that contains distinct values of a given column in this table. We have an index on the column in question.
To increase performance I created a little cache table that contains the distinct values so we didn't need to do a select distinct field from table against 10K rows. Surprisingly it seems doing select * from cachetable (10 rows) is no faster than doing the select distinct against 10K rows. Why is this? Is the index doing all the work? At what number of rows in our main table will there be a performance improvement by querying the cache table?
For a DB, 10K rows is nothing. You're not seeing much difference because the actual calculation time is minimal, with most of it consumed by other, constant, overhead.
It's difficult to predict when you'd start noticing a difference, but it would probably be at around a million rows.
If you've already set up caching and it's not detrimental, you may as well leave it in.
10k rows is not much... start caring when you reach 500k ~ 1 million rows.
Indexes do a great job, specially if you just have 10 different values for that index.
This depends on numerous factors - the amount of memory your DB has, the size of the rows in the table, use of a parameterised query and so forth, but generally 10K is not a lot of rows and particularly if the table is well indexed then it's not going to cause any modern RDBMS any sweat at all.
As a rule of thumb I would generally only start paying close attention to performance issues on a table when it passes the 100K rows mark, and 500K doesn't usually cause much of a problem if indexed correctly and accessed by such. Performance usually tends to fall off catastrophically on large tables - you may be fine on 500K rows but crawling on 600K - but you have a long way to go before you are at all likely to hit such problems.
Is the index doing all the work?
You can tell how the query is being executed by viewing the execution plan.
For example, try this:
explain plan for select distinct field from table;
select * from table(dbms_xplan.display);
I notice that you didn't include an ORDER BY on that. If you do not include ORDER BY then the order of the result set may be random, particularly if oracle uses the HASH algorithm for making a distinct list. You ought to check that.
So I'd look at the execution plans for the original query that you think is using an index, and at the one based on the cache table. Maybe post them and we can comment on what's really going on.
Incidentaly, the cache table would usually be implemented as a materialised view, particularly if the master table is generally pretty static.
Serious premature optimization. Just let the database do its job, maybe with some tweaking to the configuration (especially if it's MySQL, which has several cache types and settings).
Your query in 10K rows most probably uses HASH SORT UNIQUE.
As 10K most probably fit into db_buffers and hash_area_size, all operations are performed in memory, and you won't note any difference.
But if the query will be used as a part of a more complex query, or will be swapped out by other data, you may need disk I/O to access the data, which will slow your query down.
Run your query in a loop in several sessions (as many sessions as there will be users connected), and see how it performs in that case.
For future plans and for scalability, you may want to look into an indexing service that uses pure memory or something faster than the TCP DB round-trip. A lot of people (including myself) use Lucene to achieve this by normalizing the data into flat files.
Lucene has a built-in Ram Drive directory indexer, which can build the index all in memory - removing the dependency on the file system, and greatly increasing speed.
Lately, I've architected systems that have a single Ram drive index wrapped by a Webservice. Then, I have my Ajax-like dropdowns query into that Webservice for high availability and high speed - no db layer, no file system, just pure memory and if remote tcp packet speed.
If you have an index on the column, then all the values are in the index and the dbms never has to look in the table. It just looks in the index which just has 10 entries. If this is mostly read only data, then cache it in memory. Caching helps scalability and a lot by relieving the database of work. A query that is quick on a database with no users, might perform poorly if a 30 queries are going on at the same time.

Resources