Indexes vs full scans, how do i determine which to use and when - oracle

Is there a rule of thumb on when to use full scans rather than indexes? I am new to oracle and am still struggling to wrap my mind around performance tuning. thanks!

The intent of the optimizer is to make these decisions for you. Granted it won't always be right, but generally I'll let the optimizer do all the heavy lifting for me, and only worry about the issues where it went awry and I had to "intervene" so to speak.
But in terms of full scan versus index, I'd encourage you not to think in these terms, simply because your customer and hence your applications don't care. What customers care about is response time, so the driving factor here should always be response time.
If for query 1, a full scan is faster than an index scan, then the full scan is the better option. That doesn't make a full scan better "always" or somehow "philosophically" or "technically" better than an index scan. It simply means for this query, the full scan was best.
If for another query, the index is better, then by all means, we should use the index.
Its common to see advisories in books and blog posts, saying "Look for TABLE ACCESS FULL in the plan - thats bad", and similar. That is BS in my opinion. Whatever execution plan gives you the best performance for your query, is the best one... no matter whether it uses index scans, full scans, or any other form of optimizer path.
Lucene Which would be better: many queries or massive OR query?

Problem I have a large list of keywords that I want to see if the are contained in a document or documents. (My users want to know when a document is published, if it has any of their saved keywords)
So I could make many queries; one for each keyword.
Or I could construct a query something like: "coffee OR tea OR milk OR sugar OR beer"
Now lets say there are over 1,000 key words.
Which one is likely to lead to pain and suffering?
Would one be better over the other when running against one document or many documents?
(I am leaning towards the OR version but I am am worried I will hit some query length (performance) limit if I go too far)
Once I have enough data I will run some comparisons and report back.
Any hints between now and then would be great though.
Single Giant Query Pro: You get ranking by the Lucene's scoring algorithm for all of the keywords.
Single Giant Query Con: You make Lucene use a huge amount of memory, as it needs to remember each subquery's result (or part of it) in order to give you that nice ranking that takes all keywords into account. The bigger the OR query, the more memory Lucene needs to do it, and the slower it does it.
I'd say, if at all possible for your purposes, break it up, since OR queries are The Devil (even though it's sometimes necessary to deal with them); but benchmark should be better than asking random people for opinions :P

gathering statistics on tables without indexes

Does it make sense to gather regularly statistics on table without indexes in Oracle database? I'm asking from optimization point of view. I assume that always FULL TABLE SCAN would be performed on that table.
Yes it's still worth gathering the statistics. Information about the number and size of rows is of use to the optimizer, even though there are no indexes
In a nutshell, statistics are as important to optimizer as food is to human beings. If you don't get to eat for a long time, your brain would degrade in its functioning.
The more the optimizer knows the latest statistics, the better is the execution plan it could decide.
Let me try to explain with an example:
Let's say you are asked to reach a particular destination on a fine day. However, you are not provided with the map and location information. Now, there could be N number of ways to reach the destination, but without proper information you would make the worst possible way. If you are smart enough, you might ask for directions, now this is where you start gathering statistics. Just imagine, if you had the entire plan in mind before you start your journey, i.e. if you could gather all the statistics, you could make the best plan.
UPDATE Saw a comment about auto optimizer stats collection.
Benefits and trade offs for improving text search on small data in PostgreSQL

I have 4 text columns of interest.
Each column is up to about 100 characters.
The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
The data is currently about 500 rows. I don't expect this to grow beyond 1000.
A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.
So all this suggests a pretty small data set to consider.
I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.
PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are
Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
Full text searching. I assume this would involve concatenating for the index also.
Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?
Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?
(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)
I'm on PG 9.2.4, so any goodies pre-9.3 are an option.
Update: I've expanded this answer into a detailed blog post.
Rather than focusing purely on speed, please consider search semantics first. Define your requirements.
For example, do users need to be able to differentiate based on the order of terms? Should
radiata pinus
pinus radiata
? Does the same rule apply to words within a column as between columns?
Are spaces always word separators, or are spaces within a column part of the search term?
Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.
If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.
If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.
So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.
For a 1000 rows, I would guess that LIKE together with lower() should be fast enough. After a couple of queries the table will most probably be completely cached.
Regarding the indexing using pg_trgm: you are talking about "occasional" updates/inserts to the table. I would think that the additional costs of using a trigram index would only show up when you update/insert that table a lot - like several times a second.
If "occasional" only means several times an hour (or even less), then I doubt you'd see the difference in real live. I think somewhere in Depesz's blob there was also an article that compared the insert speed with and without a trigram index, but I can't find it anymore.

Pitfalls in prototype database design (for performance viability testing)

Following on from my previous question, I'm looking to run some performance tests on various potential schema representations of an object model. However, the catch is that while the model is conceptually complete, it's not actually finalised yet - and so the exact number of tables, and numbers/types of attributes in each table aren't definite.
From my (possibly naive) perspective it seems like it should be possible to put together a representative prototype model for each approach, and test the performance of each of these to determine which is the fastest approach for each case.
And that's where the question comes in. I'm aware that the performance characteristics of databases can be very non-intuitive, such that a small (even "trivial") change can lead to an order of magnitude difference. Thus I'm wondering what common pitfalls there might be when setting up a dummy table structure and populating it with dummy data. Since the environment is likely to make a massive difference here, the target is Oracle running on RHEL 3.
(In particular, I'm looking for examples such as "make sure that one of your tables has a much more selective index than the other"; "make sure you have more than x rows/columns because below this you won't hit page faults and the performance will be different"; "ensure you test with the DATETIME datatype if you're going to use it because it will change the query plan greatly", and so on. I tried Google, expecting there would be lots of pages/blog posts on best practices in this area, but couldn't find the trees for the wood (lots of pages about tuning performance of an existing DB instead).)
As a note, I'm willing to accept an answer along the lines of "it's not feasible to perform a test like this with any degree of confidence in the transitivity of the result", if that is indeed the case.
There are a few things that you can do to position yourself to meet performance objectives. I think they happen in this order:
be aware of architectures, best practices and patterns
be aware of how the database works
spot-test performance to get additional precision or determine impact of wacky design areas
More on each:
Architectures, best practices and patterns: one of the most common reasons for reporting databases to fail to perform is that those who build them are completely unfamiliar with the reporting domain. They may be experts on the transactional database domain - but the techniques from that domain do not translate to the warehouse/reporting domain. So, you need to know your domain well - and if you do you'll be able to quickly identify an appropriate approach that will work almost always - and that you can tweak from there.
How the database works: you need to understand in general what options the optimizer/planner has for your queries. What's the impact to different statements of adding indexes? What's the impact of indexing a 256 byte varchar? Will reporting queries even use your indexes? etc
Now that you've got the right approach, and generally understand how 90% of your model will perform - you're often done forecasting performance with most small to medium size databases. If you've got a huge one, there's a ton at stake, you've got to get more precise (might need to order more hardware), or have a few wacky spots in the design - then focus your tests on just this. Generate reasonable test data - and (important) stats that you'd see in production. And look to see what the database will do with that data. Unless you've got real data and real prod-sized servers you'll still have to extrapolate - but you should at least be able to get reasonably close.
Running performance tests against various putative implementation of a conceptual model is not naive so much as heroically forward thinking. Alas I suspect it will be a waste of your time.
Let's take one example: data. Presumably you are intending to generate random data to populate your tables. That might give you some feeling for how well a query might perform with large volumes. But often performance problems are a product of skew in the data; a random set of data will give you an averaged distribution of values.
Another example: code. Most performance problems are due to badly written SQL, especially inappropriate joins. You might be able to apply an index to tune an individual for SELECT * FROM my_table WHERE blah but that isn't going to help you forestall badly written queries.
The truism about premature optimization applies to databases as well as algorithms. The most important thing is to get the data model complete and correct. If you manage that you are already ahead of the game.
Having read the question which you linked to I more clearly understand where you are coming from. I have a little experience of this Hibernate mapping problem from the database designer perspective. Taking the example you give at the end of the page ...
Animal > Vertebrate > Mammal > Carnivore > Canine > Dog type hierarchy,
... the key thing is to instantiate objects as far down the chain as possible. Instantiating a column of Animals will perform much slower than instantiating separate collections of Dogs, Cats, etc. (presuming you have tables for all or some of those sub-types).
This is more of an application design issue than a database one. What will make a difference is whether you only build tables at the concrete level (CATS, DOGS) or whether you replicate the hierarchy in tables (ANIMALS, VERTEBRATES, etc). Unfortunately there are no simple answers here. For instance, you have to consider not just the performance of data retrieval but also how Hibernate will handle inserts and updates: a design which performs well for queries might be a real nightmare when it comes to persisting data. Also relational integrity has an impact: if you have some entity which applies to all Mammals, it is comforting to be able to enforce a foreign key against a MAMMALS table.
Performance problems with databases do not scale linearly with data volume. A database with a million rows in it might show one hotspot, while a similar database with a billion rows in it might reveal an entirely different hotspot. Beware of tests conducted with sample data.
You need good sound database design practices in order to keep your design simple and sound. Worry about whether your database meets the data requirements, and whether your model is relevant, complete, correct and relational (provided you're building a relational database) before you even start worrying about speed.
Then, once you've got something that's simple, sound, and correct, start worrying about speed. You'd be amazed at how much you can speed things up by just tweaking the physical features of your database, without changing any app code. To do this, you need to learn a lot about your particular DBMS.
They never said database development would be easy. They just said it would be this much fun!

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)
