I am working on a application. It is in its initial stage so the number of records in table is not large, but later on it will have around 1 million records in the same table.
I want to know what points I should consider while writing select query which will fetch a huge amount of data from table so it does not slow down performance.
First rule:
Don't fetch huge amounts of data back to the application.
Unless you are going to display every single one of the items in the huge amount of data, do not fetch it. Communication between the DBMS and the application is (relatively) slow, so avoid it when possible. It isn't so slow that you shouldn't use the DBMS or anything like that, but if you can reduce the amount of data flowing between DBMS and application, the overall performance will usually improve.
Often, one easy way to do this is to list only those columns you actually need in the application, rather than using 'SELECT *' to retrieve all columns when you'll only use 4 of the 24 that exist.
Second rule:
Try to ensure that the DBMS does not have to look at huge amounts of data.
To the extent possible, minimize the work that the DBMS has to do. It is busy, and typically it is busy on behalf of many people at any given time. If you can reduce the amount of work that the DBMS has to do to process your query, everyone will be happier.
Consider things like ensuring you have appropriate indexes on the table - not too few, not too many. Designed judiciously, indexes can greatly improve the performance of many queries. Always remember, though, that each index has to be maintained, so inserts, deletes and updates are slower when there are more indexes to manage on a given table.
(I should mention: none of this advice is specific to Oracle - you can apply it to any DBMS.)
To get good performance with a database there is a lot of things you need to have in mind. At first, it is the design, and here you should primary think about normalization and denormalization (split up tables but still not as much as performance heavy joins are required).
There are often a big bunch of tuning when it comes to performance. However, 80% of the performance is determined from the SQL-code. Below are some links that might help you.
http://www.smart-soft.co.uk/Oracle/oracle-performance-tuning-part7.htm
http://www.orafaq.com/wiki/Oracle_database_Performance_Tuning_FAQ
A few points to remember:
Fetch only the columns you need to use on the client side.
Ensure you set up the correct indexes that are going to help you find records. These can be done later, but it is better to plan for them if you can.
Ensure you have properly accounted for column widths and data sizes. Don't use an INT when a TINYINT will hold all possible values. A row with 100 TINYINT fields will fetch faster than a row with 100 INT fields, and you'll also be able to fetch more rows per read.
Depending on how clean you need the data to be, it may be permissable to do a "dirty read", where the database fetches data while an update is in progress. This can speed things up significantly in some cases, though it means the data you get might not be the absolute latest.
Give your DBA beer. And hugs.
Jason
Related
I am creating a database that will store 100.000 (and probably more in the future) users. While this obviously happens in a table with 1 row per user, every user can (and will) store hundreds of items. In programming language this would mean the user has 2 arrays (or one 2-dimensional array) of integers: a column for the itemid's and a column for the amounts.
My instincts tell me to create a table to hold all these items, with rows like (userid, itemid, amount). However this would result in a huge table. 200.000 users with 250 items each... that's 50 million entries in one table. This, plus the fact that the table will undergo continuous and rapid change, frightens me. (How rapid? I estimate up to 100 modifications per second.)
Typically there will be anywhere between 100 and 2000 users, all adding and removing items, and modifying amounts. These actions can and will happen in programming code. It would go as follows:
User starts session, program loads all the users items from the database
User modifies the item list
Every few minutes, the changes are saved into the database
When the user ends the session, it is also saved into the database
It is worth noting that there is a maximum to the number of items a user can store.
Are there any alternatives to using a separate table? Perhaps save the values in a formatted text string? Or is this one of the instances where using a MySQL database is actually a Bad Idea™?
Thank you for your time and insights.
My instincts tell me to create a table to hold all these items
Your instincts are right.
1) avoid premature optimisation
2) don't break the rules of normalization unless you've got a very good and real reason to do so
3) why do you suspect that the multi-table approach will be faster?
that's 50 million entries in one table
So what? Even if you only have an index on userid, the difference in performance compared with a single table per user will not be noticeably slower (in practice, with 200,000 users, it will be much, much faster - since the DBMS can comfortably keep an open file handle for each table!).
I estimate up to 100 modifications per second
Should be possible using MySQL and fairly basic hardware, but if it were me, and I wanted a bit of headroom, I'd go with a pair of mirrored SATA disks, tables on one mirror, indexes on the other.
The only issue I'd be concerned about (which applies regardless of which of the 2 models you choose) is supporting 2000 concurrent connections. Do the connections have to be concurrent? Or can each user download a working set (optionally using an optimistic locking strategy) and close off the connection, then push back the changes on a new connection? If not, then you'll probably want a good whack of memory and CPU.
But leaving aside whether to use one big table or lots of little ones, if this is the only use for the data, and access is not concurrent to particular data items, then why bother with a relational database at all? NoSQL or a shared filesystem might work just as well.
Putting data into one field as a array is alwmost always a mistake. It makes querying the data much harder and much more timeconsuming as well as much less likely to use indexes. It is ok, if the values were just text where you would never need to find one or more elements fo the array but it is my experience that this situation is rarely encountered. Modern databases can handle 50 million records without even breaking a sweat. That's a small table in daatbase terms.
It should be OK to do it as you described using two tables. The database should be able to handle millions of records.
The important points to look at:
1- Optimize your queries as much as possible.
2- Create the appropriate index(es) to speed up your queries.
3- Use InnoDB if you have concurrent read/update operations as it supports row-level locking as opposed to MyISAM.
4- Provide good hardware to support the database server.
5- Run the database server on a dedicated server if affordable.
I have an app using an Oracle 11g database. I have a fairly large table (~50k rows) which I query thus:
SELECT omg, ponies FROM table WHERE x = 4
Field x was not indexed, I discovered. This query happens a lot, but the thing is that the performance wasn't too bad. Adding an index on x did make the queries approximately twice as fast, which is far less than I expected. On, say, MySQL, it would've made the query ten times faster, at the very least. (Edit: I did test this on MySQL, and there saw a huge difference.)
I'm suspecting Oracle adds some kind of automatic index when it detects that I query a non-indexed field often. Am I correct? I can find nothing even implying this in the docs.
As has already been indicated, Oracle11g does NOT dynamically build indexes based on prior experience. It is certainly possible and indeed happens often that adding an index under the right conditions will produce the order of magnitude improvement you note.
But as has also already been noted, 50K (seemingly short?) rows is nothing to Oracle. The Oracle database in fact has a great deal of intelligence that allows it to scan data without indexes most efficiently. Every new release of the Oracle RDBMS gets better at moving large amounts of data. I would suggest to you that the reason Oracle was so close to its "best" timing even without the index as compared to MySQL is that Oracle is just a more intelligent database under the covers.
However, the Oracle RDBMS does have many features that touch upon the subject area you have opened. For example:
10g introduced a feature called AUTOMATIC SQL TUNING which is exposed via a gui known as the SQL TUNING ADVISOR. This feature is intended to analyze queries on its own, in depth and includes the ability to do WHAT-IF analysis of alternative query plans. This includes simulation of indexes which do not actually exist. However, this would not explain any performance differences you have seen because the feature needs to be turned on and it does not actually build any indexes, it only makes recommendations for the DBA to make indexes, among other things.
11g includes AUTOMATIC STATISTICS GATHERING which when enabled will automatically collect statistics on database objects as it deems necessary based on activity on those objects.
Thus the Oracle RDBMS is doing what you have suggested, dynamically altering its environment on its own based on its experience with your workload over time in order to improve performance. Creating indexes on the fly is just not one of the things is does yet. As an aside, this has been hinted to by Oracle in private sevearl times so I figure it is in the works for some future release.
Does Oracle 11g automatically index fields frequently used for full table scans?
No.
In regards the MySQL issue, what storage engine you use can make a difference.
"MyISAM relies on the operating system for caching reads and writes to the data rows while InnoDB does this within the engine itself"
Oracle will cache the table/data rows, so it won't need to hit the disk. depending on the OS and hardware, there's a chance that MySQL MyISAM had to physically read the data off the disk each time.
~50K rows, depending greatly on how big each row is, could conceivably be stored in under 1000 blocks, which could be quickly read into the buffer cache by a full table scan (FTS) in under 50 multi-block reads.
Adding appropriate index(es) will allow queries on the table to scale smoothly as the data volume and/or access frequency goes up.
"Adding an index on x did make the
queries approximately twice as fast,
which is far less than I expected. On,
say, MySQL, it would've made the query
ten times faster, at the very least."
How many distinct values of X are there? Are they clustered in one part of the table or spread evenly throughout it?
Indexes are not some voodoo device: they must obey the laws of physics.
edit
"Duplicates could appear, but as it
is, there are none."
If that column has neither a unique constraint nor a unique index the optimizer will choose an execution path on the basis that there could be duplicate values in that column. This is the value of declaring the data model as accuratley as possible: the provision of metadata to the optimizer. Keeping the statistics up to date is also very useful in this regard.
You should have a look at the estimated execution plan for your query, before and after the index has been created. (Also, make sure that the statistics are up-to-date on your table.) That will tell you what exactly is happening and why performance is what it is.
50k rows is not that big of a table, so I wouldn't be surprised if the performance was decent even without the index. Thus adding the index to equation can't really bring much improvement to query execution speed.
Oracle seems to have a reputation for being more scalable than other RDBMSes. After working with it a bit, I can say that it's more complex than other RDBMSes, but I haven't really seen anything that makes it more scalable than other RDBMSes. But then again, I haven't really worked on it in a whole lot of depth.
What features does Oracle have that are more scalable?
Oracle's RAC architecture is what makes it scalable where it can load balance across nodes and parallel queries can be split up and pushed to other nodes for processing.
Some of the tricks like loading blocks from another node's buffer cache instead of going to disc make performance a lot more scalable.
Also, the maintainability of RAC with rolling upgrades help make the operation of a large system more sane.
There is also a different aspect of scalability - storage scalability. ASM makes increasing the storage capacity very straightforward. A well designed ASM based solution, should scale past the 100s of terabyte size without needing to do anything very special.
Whether these make Oracle more scalable than other RDBMSs, I don't know. But I think I would feel less happy about trying to scale up a non-Oracle database.
Cursor sharing is (or was) a big advantage over the competition.
Basically, the same query plan is used for matching queries. An application will have a standard set of queries it issue (eg get the orders for this customer id). The simple way is to treat every query individually, so if you see 'SELECT * FROM ORDERS WHERE CUSTOMER_ID = :b1', you look at whether table ORDERS has an index on CUSTOMER_ID etc. As a result, you can spend as much work looking up meta data to get a query plan as actually retrieving the data. With simple keyed lookups, a query plan is easy. Complex queries with multiple tables joined on skewed columns are harder.
Oracle has a cache of query plans, and older/less used plans are aged out as new ones are required.
If you don't cache query plans, there's a limit to how smart you can make your optimizer as the more smarts you code into it, the bigger impact you have on each query processed. Caching queries means you only incur that overhead the first time you see the query.
The 'downside' is that for cursor sharing to be effective you need to use bind variables. Some programmers don't realise that and write code that doesn't get shared and then complain that Oracle isn't as fast as mySQL.
Another advantage of Oracle is the UNDO log. As a change is done, the 'old version' of the data is written to an undo log. Other database keep old versions of the record in the same place as the record. This requires VACUUM style cleanup operations or you bump into space and organisation issues. This is most relevant in databases with high update or delete activity.
Also Oracle doesn't have a central lock registry. A lock bit is stored on each individual data record. SELECT doesn't take a lock. In databases where SELECT locks, you could have multiple users reading data and locking each other or preventing updates, introducing scalability limits. Other databases would lock a record when a SELECT was done to ensure that no-one else could change that data item (so it would be consistent if the same query or transaction looked at the table again). Oracle uses UNDO for its read consistency model (ie looking up the data as it appeared at a specific point in time).
Tom Kyte's "Expert Oracle Database Architecture" from Apress does a good job of describing Oracle's architecture, with some comparisons with other rDBMSs. Worth reading.
We have about 10K rows in a table. We want to have a form where we have a select drop down that contains distinct values of a given column in this table. We have an index on the column in question.
To increase performance I created a little cache table that contains the distinct values so we didn't need to do a select distinct field from table against 10K rows. Surprisingly it seems doing select * from cachetable (10 rows) is no faster than doing the select distinct against 10K rows. Why is this? Is the index doing all the work? At what number of rows in our main table will there be a performance improvement by querying the cache table?
For a DB, 10K rows is nothing. You're not seeing much difference because the actual calculation time is minimal, with most of it consumed by other, constant, overhead.
It's difficult to predict when you'd start noticing a difference, but it would probably be at around a million rows.
If you've already set up caching and it's not detrimental, you may as well leave it in.
10k rows is not much... start caring when you reach 500k ~ 1 million rows.
Indexes do a great job, specially if you just have 10 different values for that index.
This depends on numerous factors - the amount of memory your DB has, the size of the rows in the table, use of a parameterised query and so forth, but generally 10K is not a lot of rows and particularly if the table is well indexed then it's not going to cause any modern RDBMS any sweat at all.
As a rule of thumb I would generally only start paying close attention to performance issues on a table when it passes the 100K rows mark, and 500K doesn't usually cause much of a problem if indexed correctly and accessed by such. Performance usually tends to fall off catastrophically on large tables - you may be fine on 500K rows but crawling on 600K - but you have a long way to go before you are at all likely to hit such problems.
Is the index doing all the work?
You can tell how the query is being executed by viewing the execution plan.
For example, try this:
explain plan for select distinct field from table;
select * from table(dbms_xplan.display);
I notice that you didn't include an ORDER BY on that. If you do not include ORDER BY then the order of the result set may be random, particularly if oracle uses the HASH algorithm for making a distinct list. You ought to check that.
So I'd look at the execution plans for the original query that you think is using an index, and at the one based on the cache table. Maybe post them and we can comment on what's really going on.
Incidentaly, the cache table would usually be implemented as a materialised view, particularly if the master table is generally pretty static.
Serious premature optimization. Just let the database do its job, maybe with some tweaking to the configuration (especially if it's MySQL, which has several cache types and settings).
Your query in 10K rows most probably uses HASH SORT UNIQUE.
As 10K most probably fit into db_buffers and hash_area_size, all operations are performed in memory, and you won't note any difference.
But if the query will be used as a part of a more complex query, or will be swapped out by other data, you may need disk I/O to access the data, which will slow your query down.
Run your query in a loop in several sessions (as many sessions as there will be users connected), and see how it performs in that case.
For future plans and for scalability, you may want to look into an indexing service that uses pure memory or something faster than the TCP DB round-trip. A lot of people (including myself) use Lucene to achieve this by normalizing the data into flat files.
Lucene has a built-in Ram Drive directory indexer, which can build the index all in memory - removing the dependency on the file system, and greatly increasing speed.
Lately, I've architected systems that have a single Ram drive index wrapped by a Webservice. Then, I have my Ajax-like dropdowns query into that Webservice for high availability and high speed - no db layer, no file system, just pure memory and if remote tcp packet speed.
If you have an index on the column, then all the values are in the index and the dbms never has to look in the table. It just looks in the index which just has 10 entries. If this is mostly read only data, then cache it in memory. Caching helps scalability and a lot by relieving the database of work. A query that is quick on a database with no users, might perform poorly if a 30 queries are going on at the same time.
I have a big doubt.
Let's take as example a database for a whatever company's orders.
Let's say that this company make around 2000 orders per month, so, around 24K order per year, and they don't want to delete any orders, even if it's 5 years old (hey, this is an example, numbers don't mean anything).
In the meaning of have a good database query speed, its better have just one table, or will be faster having a table for every year?
My idea was to create a new table for the orders each year, calling such orders_2008, orders_2009, etc..
Can be a good idea to speed up db queries?
Usually the data that are used are those of the current year, so there are less lines the better is..
Obviously, this would give problems when I search in all the tables of the orders simultaneously, because should I will to run some complex UNION .. but this happens in the normal activities very rare.
I think is better to have an application that for 95% of the query is fast and the remaining somewhat slow, rather than an application that is always slow.
My actual database is on 130 tables, the new version of my application should have about 200-220 tables.. of which about 40% will be replicated annually.
Any suggestion?
EDIT: the RDBMS will be probably Postgresql, maybe (hope not) Mysql
Smaller tables are faster. Period.
If you have history that is rarely used, then getting the history into other tables will be faster.
This is what a data warehouse is about -- separate operational data from historical data.
You can run a periodic extract from operational and a load to historical. All the data is kept, it's just segregated.
Before you worry about query speed, consider the costs.
If you split the code into separate code, you will have to have code that handles it. Every bit of code you write has the chance to be wrong. You are asking for your code to be buggy at the expense of some unmeasured and imagined performance win.
Also consider the cost of machine time vs. programmer time.
If you use indexes properly, you probably need not split it into multiple tables. Most modern DBs will optimize access.
Another option you might consider is to have a table for the current year, and at the end append the data to another table which has data for all the previous years. ?
I would not split tables by year.
Instead I would archive data to a reporting database every year, and use that when needed.
Alternatively you could partition the data, amongst drives, thus maintaining performance, although i'm unsure if this is possible in postgresql.
For the volume of data you're looking at splitting the data seems like a lot of trouble for little gain. Postgres can do partitioning, but the fine manual [1] says that as a rule of thumb you should probably only consider it for tables that exceed the physical memory of the server. In my experience, that's at least a million rows.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html
I agree that smaller tables are faster. But it depends on your business logic if it makes sense to split a single entity over multiple tables. If you need a lot of code to manage all the tables than it might not be a good idea.
It also depends on the database what logic you're able to use to tackle this problem. In Oracle a table can be partitioned (on year for example). Data is stored physically in different table spaces which should make it faster to address (as I would assume that all data of a single year is stored together)
An index will speed things up but if the data is scattered across the disk than a load of block reads are required which can make it slow.
Look into partitioning your tables in time slices. Partitioning is good for the log-like table case where no foreign keys point to the tables.