Number VS Varchar(2) Primary Keys - performance

I'm now to this point of my project that I need to design my database (Oracle).
Usually for the status and countries tables I don’t use a numeric primary key, for example
STATUS (max 6)
AC --> Active
DE --> Deleted
COUNTRIES (total 30)
UK --> United Kingdom
IT --> Italy
GR --> Greece
These tables are static, not updated through the application and it's not foreseen to be change in the future so there is no chance having update problems in tables that will use these values as foreign keys.
The main table of the application will use status and country (more than once e.g. origin country, destination country) and it is foreseen that 600000 rows will be added per year
So my question is, will these VARCHAR(2) keys will have an impact in the performance when querying the join of there 3 tables.
Will the first be significantly slower than the second?
FROM main m, status s, countries c
WHERE m.status_cd = s.status_cd
AND m.country_cd = c.country_cd
AND m.status_cd = 'AC'
AND m.country_cd = 'UK'
FROM main m, status s, countries c
WHERE m.status_cd = s.status_cd
AND m.country_cd = c.country_cd
AND m.status_cd = 1
AND m.country_cd = 2
Status is not binary ("max 6" next to the table name). The values will probably be:
* active
* deleted
* draft
* send
* replaced
and we need to display the decoded values to the user, so we need the names.

Both the status and country tables are so small that they are going to be memory resident in practice, whether formally stated as such or not. Indeed, except that a foreign key normally requires an index on the referenced primary key field, you might be tempted not to bother with any indexes on the tables.
The performance difference between the joins with different types is going to be negligible, and the numeric code will, if anything, be slower since there's 'more' data to store (but it is all so small that it is negligible, again).
So, go with the natural codes. All else apart, the SQL in the first example is clearer; the 'UK' and 'AC' are much more meaningful than 1 and 2.
In non-Oracle DBMS, you would probably use CHAR(2) for both the status and country code values. Oracle users tend to use VARCHAR2 for everything; I'm not sure whether there is a penalty for using a CHAR(2) column instead, especially since the column values are fixed length. (Under Informix, for instance, a VARCHAR(2) field - a field of up to two characters - would store as 3 bytes, a length (always 2 in your case) and the 2 data bytes. By contrast, a CHAR(2) field would occupy just 2 bytes.)

Check out this link. Bottom line is there isn't much performance difference between varchar and num. So you should go for which ever makes sense for the column. Here, the varchar seems to make more sense.

If 'status' is (and will always be?) a binary active/deleted field why bother with the table at all. It seems like normalization taken to an impractical extreme.
It would certainly be quicker, not to mention easier, to simply use a tinyint(1) field and record the active/deleted state as a 1 or 0.
This eliminates one of your joins entirely which has got to be a good thing.

It does not matter which methode you choose in this case. The important part is to use the same kind throughout the database and be consistent in your id convention.


When to use RAW datatype column over VARCHAR2 in Oracle?

I've been working in a large scale project where all the primary keys are stored as RAW type. The ID field is auto-generated as a unique 16 digit UUID. I can't find any particular advantage of using RAW type column. Can someone help understand if there is any real advantage of storing primary keys in RAW format instead of VARCHAR2?
A GUID in Oracle is represented as raw(16).
You can get a GUID like this:
select sys_guid() from dual;
That's why you should use raw(16).
Well in the database design typically the size matter. The bigger key takes more space in storage, on disc, the sorting takes longer time etc.
From this point the integer database key is the most compact one (implemented as a NUMBER type with zero precision, allocation typically between 2-8 bytes).
From various reasons UUID is used as a key – with various motivations that are often independent of the database design rules.
Additionally, the UUID is often stored as formatted string in a VARCHAR2 column.
This is similar design as if you would store DATEs as a string (which is considered not a best practice).
Despite of it the RAW(16) columns allocate 16 bytes, the formatted UUID 36 bytes.
So in summary IMO there a following recommendations
Use NUMBER keys
If you can’t (and have solid arguments for it) use UUID in RAW(16) format
Note that of course the RAW format is a bit inconvenient to handle than a string (e.g. in setting of a bind variable). This often leads to the decision of storing the UUIDas a string - the vast majority of cases I encountered.
Below a small example illustrating the difference in sizing
create table tab
(id INT,
insert into tab(ID,RAW_UUID) values (1,sys_guid());
insert into tab(ID,RAW_UUID) values (1000000001,sys_guid());
select * from tab;
---------- --------------------------------
1 8135869AECF44FB280A04033888FD518
1000000001 DE04ED07DDD84D1AABE9059F38364C7E
select vsize(id), vsize(raw_uuid) from tab;
---------- ---------------
2 16
6 16
What you can do is to define a virtual column (i.e. column that allocates no space) that presents the formatted UUID:
alter table tab add ( UUID VARCHAR2(36) GENERATED ALWAYS AS
Now the table has the text form UUID as well and you can use the familiar query
select * from tab where uuid = 'cbf7e2e2-a9e9-40fb-badc-18cb9a4fe663';
You can even define an index on the virtual column, but always before using UUID think on the Rule 1 above.

Is it OK to create several indexes on a table if they're really needed

I have a table with 7 columns.
It's going to contain lots and lots of data - something like more than 1.7 million records will be added every month.
Of those 7 columns 5 are the ones that I'll be using in the WHERE clause of my queries against this table in different combinations.
Is it OK to create different indexes for those possible combinations ?
I'm asking this question because if I do that, there'll be more than 10 indexes on this table and I'm not sure if this is a good idea.
On the other hand, I'm afraid of querying a table with this big amount of data without indexes.
Here's the table:
Possible queries:
and so on.
Ignoring index skip scans* for the moment, in order for a query to use an index:
The leading index columns must be listed in the query
They must compared using exact joins (i.e. using =, not <,> or like)
For example, a table with a composite index on (a, b) could use the index in the following queries:
a = :b1 and b >= :b2
a = :b1
but not:
b = :b2
because column b is listed second in the index. * In some cases, it's possible for the index to be used in this case via an index skip scan. This is where the leading column in the index is skipped. There needs to be relatively few distinct values for the first column however, which doesn't happen often (in my experience).
Note that a "larger" index can be used by queries which only use some of the leading columns from it. So in the example above, an index on just a is redundant because the queries shown can use the index on a, b. An index on just b may be useful however.
The more indexes you add, the slower your inserts/updates/deletes will be, because the indexes have to be maintained at the same time as the table. Therefore you should aim to keep the number of indexes down, unless there's significant query benefits to adding a new one. This is something you'll have to measure in your environment to determine the exact cost/benefit.
Note that having multiple indexes with similar columns can lead to the wrong index being selected. So there is potential downside for selects when you have many similar indexes. There is also a slight overhead in parse times, as Oracle has more options to consider when selecting the execution plan.
Looking at your queries I believe you only need indexes on:
st, departid, period
st, pensionerid, period
You may wish to add amount at the end of these as well, so your queries can be fully answered from the index, saving you a table lookup. You may also need further indexes if these columns are foreign keys to other tables, to prevent locking issues.
This decision would greatly depend on expected number of distinct values in each column, and thus selectivity of each possible index.
Things I would consider while making decisions:
Obviously, PAYMENTTYPE & ST fields hold up to 10 19 distinct values each, which is pretty unselective if we keep in mind your expected volume of data (~400M rows), so they won't help you much.
However, they probably could become good candidates for list partitioning instead.
I would also think of switching PERIOD CHAR(6 CHAR) to DATE and making a composite range-list partition on period+st/paymenttype.
DEPARTID - If you have hundreds of departments, then it's probably an indexing candidate, but if only dozens - then probably a full scan would perform way faster.
PENSIONERID seems to be a high-selectivity field, so I would consider creating a separate index on it, and including it in a composite index on PERIOD+PENSIONERID (in that field order).
I think you should create a few combined indexes (like ('ST' and 'PERIOD') and ('ST' and 'PENSIONERID'). That will speed up most of your sample queries...

Real Time issues: Oracle Performance tuning (types / indexes / plsql / queries)

I am looking for a real time solution...
Below are my DB columns. I am using Oracle10g. Please help me in defining table types / indexes and tuned PLSQL / query (both) for the updates and insertion
Insert and Update queries are simple but here we need to take care of the performance because my system will execute such 200 times per second.
Let me know... should I use procedures or simple queries? It is requested to write tuned plsql and query with proper DB table types / indexes.
I would really like to see the performance of my system after continuous 200 updates per second
DB table (columns) (I can change the structure if required so please let me know...)
Play ID - ID
Type - Song or Message
Count - Summation of total play
Retries - Summation of total play, if failed.
Duration - Total Duration
Last Updated - Late Updated Date Time
Thanks in advance ... let me know in case of any confusion...
You've not really given a lot of detail about WHAT you are updating etc.
As a basis for you to write your update statements, don't use PL/SQL unless you cannot achieve what you want to do in SQL as the context switching alone will hurt your performance before you even get round to processing any records.
If you are able to create indexes specifically for the update then index the columns that will appear in your update statement's WHERE clause so the records can be found quickly before being updated.
As for inserting, look up the benefits of the /*+ append */ hint for inserting records to see if it will benefit your particular case.
Finally, the table structure you will use will depend on may factors that you haven't even begun to touch on with the details you've supplied, I suggest you either do some research on DB structure or ask your DBA's for a 101 class in it.
Best of luck...
In response to:
Play ID - ID ( here id would be song name like abc.wav may be VARCHAR2, yet not decided..whats your that fine if primary key is of type VARCHAR2....any suggesstions are most welcome...... ) Type - Song or Message ( varchar2) Count - Summation of total play ( Integer) Retries - Summation of total play, if failed. ( Integer) Duration - Total Duration ( Integer) Last Updated - Late Updated Date Time ( DateTime )
There is nothing wrong with having a PRIMARY KEY as a VARCHAR2 data type (though there is often debate about the value of having a non-specific PK, i.e. a sequence). You must, however, ensure your PK is unique, if you can't guarentee this then it would be worth having a sequence as your PK over having to introduce another columnn to maintain uniqueness.
As for declaring your table columns as INTEGER, they eventually will be resolved to NUMBER anyway so I'd just create the table column as a number (unless you have a very specific reason for creating them as INTEGER).
Finally, the DATETIME column, you only need decare it as a DATE datatype unless you need real precision in your time portion, in which case declare it as a TIMESTAMP datatype.
As for helping you with the structure of the table itself (i.e. which columns you want etc.) then that is not something I can help you with as I know nothing of your reporting requirements, application requirements or audit requirements, company best practice, naming conventions etc. I'm afraid that is something for you to decide for yourself.
For performance though, keep indexes to a minumum (i.e. only index columns that will aid your UPDATE WHERE clause search), only update the minimum data possible and, as suggested before, research the APPEND hint for inserts it may help in your case but you will have to test it for yourself.

Oracle - Understanding the no_index hint

I'm trying to understand how no_index actually speeds up a query and haven't been able to find documentation online to explain it.
For example I have this query that ran extremely slow
select *
from <tablename>
where field1_ like '%someGenericString%' and
field1_ <> 'someSpecificString' and
Action_='_someAction_' and
Timestamp_ >= trunc(sysdate - 2)
And one of our DBAs was able to speed it up significantly by doing this
select /*+ NO_INDEX(TAB_000000000019) */ *
from <tablename>
where field1_ like '%someGenericString%' and
field1_ <> 'someSpecificString' and
Action_='_someAction_' and
Timestamp_ >= trunc(sysdate - 2)
And I can't figure out why? I would like to figure out why this works so I can see if I can apply it to another query (this one a join) to speed it up because it's taking even longer to run.
** Update **
Here's what I know about the table in the example.
It's a 'partitioned table'
TAB_000000000019 is the table not a column in it
field1 is indexed
Oracle's optimizer makes judgements on how best to run a query, and to do this it uses a large number of statistics gathered about the tables and indexes. Based on these stats, it decides whether or not to use an index, or to just do a table scan, for example.
Critically, these stats are not automatically up-to-date, because they can be very expensive to gather. In cases where the stats are not up to date, the optimizer can make the "wrong" decision, and perhaps use an index when it would actually be faster to do a table scan.
If this is known by the DBA/developer, they can give hints (which is what NO_INDEX is) to the optimizer, telling it not to use a given index because it's known to slow things down, often due to out-of-date stats.
In your example, TAB_000000000019 will refer to an index or a table (I'm guessing an index, since it looks like an auto-generated name).
It's a bit of a black art, to be honest, but that's the gist of it, as I understand things.
Disclaimer: I'm not a DBA, but I've dabbled in that area.
Per your update: If field1 is the only indexed field, then the original query was likely doing a fast full scan on that index (i.e. reading through every entry in the index and checking against the filter conditions on field1), then using those results to find the rows in the table and filter on the other conditions. The conditions on field1 are such that an index unique scan or range scan (i.e. looking up specific values or ranges of values in the index) would not be possible.
Likely the optimizer chose this path because there are two filter predicates on field1. The optimizer would calculate estimated selectivity for each of these and then multiply them to determine their combined selectivity. But in many cases this will significantly underestimate the number of rows that will match the condition.
The NO_INDEX hint eliminates this option from the optimizer's consideration, so it essentially goes with the plan it thinks is next best -- possibly in this case using partition elimination based on one of the other filter conditions in the query.
Using an index degrades query performance if it results in more disk IO compared to querying the table with an index.
This can be demonstrated with a simple table:
create table tq84_ix_test (
a number(15) primary key,
b varchar2(20),
c number(1)
The following block fills 1 Million records into this table. Every 250th record is filled with a rare value in column b while all the others are filled with frequent value:
rows_inserted number := 0;
while rows_inserted < 1000000 loop
if mod(rows_inserted, 250) = 0 then
insert into tq84_ix_test values (
-1 * rows_inserted,
'rare value',
rows_inserted := rows_inserted + 1;
insert into tq84_ix_test values (
trunc(dbms_random.value(1, 1e15)),
'frequent value',
rows_inserted := rows_inserted + 1;
exception when dup_val_on_index then
end if;
end loop;
An index is put on the column
create index tq84_index on tq84_ix_test (b);
The same query, but once with index and once without index, differ in performance. Check it out for yourself:
set timing on
select /*+ no_index(tq84_ix_test) */
b = 'frequent value';
select /*+ index(tq84_ix_test tq84_index) */
b = 'frequent value';
Why is it? In the case without the index, all database blocks are read, in sequential order. Usually, this is costly and therefore considered bad. In normal situation, with an index, such a "full table scan" can be reduced to reading say 2 to 5 index database blocks plus reading the one database block that contains the record that the index points to. With the example here, it is different altogether: the entire index is read and for (almost) each entry in the index, a database block is read, too. So, not only is the entire table read, but also the index. Note, that this behaviour would differ if c were also in the index because in that case Oracle could choose to get the value of c from the index instead of going the detour to the table.
So, to generalize the issue: if the index does not pick few records then it might be beneficial to not use it.
Something to note about indexes is that they are precomputed values based on the row order and the data in the field. In this specific case you say that field1 is indexed and you are using it in the query as follows:
where field1_ like '%someGenericString%' and
field1_ <> 'someSpecificString'
In the query snippet above the filter is on both a variable piece of data since the percent (%) character cradles the string and then on another specific string. This means that the default Oracle optimization that doesn't use an optimizer hint will try to find the string inside the indexed field first and also find if the data it is a sub-string of the data in the field, then it will check that the data doesn't match another specific string. After the index is checked the other columns are then checked. This is a very slow process if repeated.
The NO_INDEX hint proposed by the DBA removes the optimizer's preference to use an index and will likely allow the optimizer to choose the faster comparisons first and not necessarily force index comparison first and then compare other columns.
The following is slow because it compares the string and its sub-strings:
field1_ like '%someGenericString%'
While the following is faster because it is specific:
field1_ like 'someSpecificString'
So the reason to use the NO_INDEX hint is if you have comparisons on the index that slow things down. If the index field is compared against more specific data then the index comparison is usually faster.
I say usually because when the indexed field contains more redundant data like in the example #Atish mentions above, it will have to go through a long list of comparison negatives before a positive comparison is returned. Hints produce varying results because both the database design and the data in the tables affect how fast a query performs. So in order to apply hints you need to know if the individual comparisons you hint to the optimizer will be faster on your data set. There are no shortcuts in this process. Applying hints should happen after proper SQL queries have been written because hints should be based on the real data.
Check out this hints reference:
To add to what Rene' and Dave have said, this is what I have actually observed in a production situation:
If the condition(s) on the indexed field returns too many matches, Oracle is better off doing a Full Table Scan.
We had a report program querying a very large indexed table - the index was on a region code and the query specified the exact region code, so Oracle CBO uses the index.
Unfortunately, one specific region code accounted for 90% of the tables entries.
As long as the report was run for one of the other (minor) region codes, it completed in less than 30 minutes, but for the major region code it took many hours.
Adding a hint to the SQL to force a full table scan solved the problem.
Hope this helps.
I had read somewhere that using a % in front of query like '%someGenericString%' will lead to Oracle ignoring the INDEX on that field. Maybe that explains why the query is running slow.

How to protect a running column within Oracle/PostgreSQL (kind of MAX-result locking or something)

I'd need advice on following situation with Oracle/PostgreSQL:
I have a db table with a "running counter" and would like to protect it in the following situation with two concurrent transactions:
T1 T2
-- C for new : result + 1
-- C for new : result + 1
So, in both cases, the column value for INSERT is calculated from the old result added by one.
From this, some running counter handled by the db would be fine. But that wouldn't work because
the counter values or existing rows are sometimes changed
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
With some other databases this can be handled with SERIALIZABLE isolation state but at least with Oracle&Postgre the phantom reads are prevented but as the result the table ends up with two distinct rows with same counter value. This seems to have to do with the predicate locking, locking "all the possible rows covered by the query" - some other db:s end up to lock the whole table or something..
SELECT ... FOR UPDATE -statements seem to be for other purposes and don't even seem to work with MAX() -function.
Setting an UNIQUE contraint on the column would probably be the solution but are there some other ways to prevent the situation?
b.r. Touko
EDIT: One more option could probably be manual locking even though it doesn't appear nice to me..
Both Oracle and PostgreSQL support what's called sequences and the perfect fit for your problem. You can have a regular int column, but define one sequence per group, and do a single query like
insert into table (id, ... ) values (nextval(sequence_name_for_group_xx), ... )
insert into table (id, ... ) values (sequence_name_for_group_xx.nextval, ... )
Increments in sequences are atomic, so your problem just wouldn't exist. It's only a matter of creating the required sequences, one per group.
the counter values or existing rows are sometimes changed
You should to put a unique constraint on that column if this would be a problem for your app. Doing so would guarantee a transaction at SERIALIZABLE isolation level would abort if it tried to use the same id as another transaction.
One more option could probably be manual locking even though it doesn't appear nice to me..
Manual locking in this case is pretty easy: just take a SHARE UPDATE EXCLUSIVE or stronger lock on the table before selecting the maximum. This will kill concurrent performance, though.
sometimes I'd like there to be multiple counter "value groups" (as with the CODE mentioned) : with different values for CODE the counters would be independent.
This leads me to the Right Solution for this problem: sequences. Set up several sequences, one for each "value group" you want to get IDs in their own range. See Section 9.15 of The Manual for the details of sequences and how to use them; it looks like they're a perfect fit for you. Sequences will never give the same value twice, but might skip values: if a transaction gets the value '2' from a sequence and aborts, the next transaction will get the value '3' rather than '2'.
The sequence answer is common, but might not be right. The viability of this solution depends on what you actually need. If what you semantically want is "some guaranteed to be unique number" then that is what a sequence is for. However, if what you want is to make sure that your value increases by exactly one on each insert (as you have asked), then DO NOT USE A SEQUENCE! I have run into this trap before myself. Sequences are not guaranteed to be sequential! They can skip numbers. Depending on what sort of optimizations you have configured, they can skip LOTS of numbers. Even if you have things configured just right so that you shouldn't skip any numbers, that is not guaranteed, and is not what sequences are for. So, you are only asking for trouble if you (mis)use them like that.
One step better solution is to bundle the select into the insert, like so:
INSERT INTO table(code, c, ...)
VALUES ('XX', (SELECT MAX(c) + 1 AS c FROM table WHERE code = 'XX'), ...);
(I haven't test run that query, but I'm pretty sure it should work. My apologies if it doesn't.) But, doing something like that reflects the semantic intent of what you are trying to do. However, this is inefficient, because you have to do a scan for MAX, and the inference I am taking from your sample is that you have a small number of code values relative to the size of the table, so you are going to do an expensive, full table scan on every insert. That isn't good. Also, this doesn't even get you the ACID guarantee you are looking for. The select is not transactionally tied to the insert. You can't "lock" the result of the MAX() function. So, you could still have two transactions running this query and they both do the sub-select and get the same max, both add one, and then both try to insert. It's a much smaller window, but you may still technically have a race condition here.
Ultimately, I would challenge that you probably have the wrong data model if you are trying to increment on insert. You should insert with a unique key, most commonly a sequence value (at least as an easy, surrogate key for any natural key). That gets the data safely inserted. Then, if you need a count of things, then have one table that stores your counts.
CREATE TABLE code_counts (
code VARCHAR(2), --or whatever
count NUMBER
If you really want to store the code count of each item as it is inserted, the separate count table also allows you to do so correctly, transactionally, like so:
UPDATE code_counts SET count = count + 1 WHERE code = 'XX' RETURNING count INTO :count;
INSERT INTO table(code, c, ...) VALUES ('XX', :count, ...);
The key is that the update locks the counter table and reserves that value for you. Then your insert uses that value. And all of that is committed as one transactional change. You have to do this in a transaction. Having a separate count table avoids the full table scan of doing SELECT MAX().... In essense, what this does is re-implements a sequence, but it also guarantees you sequencial, ordered use.
Without knowing your whole problem domain and data model, it is hard to say, but abstracting your counts out to a separate table like this where you don't have to do a select max to get the right value is probably a good idea. Assuming, of course, that a count is what you really care about. If you are just doing logging or something where you want to make sure things are unique, then use a sequence, and a timestamp to sort by.
Note that I'm saying not to sort by a sequence either. Basically, never trust a sequence to be anything other than unique. Because when you get to caching sequence values on a multi-node system, your application might even consume them out of order.
This is why you should use the Serial datatype, which defers the lookup of C to the time of insert (which uses table locks i presume). You would then not specify C, but it would be generated automatically. If you need C for some intermediate calculation, you would need to save first, then read C and finally update with the derived values.
Edit: Sorry, I didn't read your whole question. What about solving your other problems with normalization? Just create a second table for each specific type (for each x where A='x'), where you have another auto increment. Manually edited sequences could be another column in the same table, which uses the generated sequence as a base (i.e if pk = 34 you can have another column mypk='34Changed').
You can create sequential collumn by using sequence as default value:
First, you have to create the sequence counter:
So, you can use it as default value:
Now you don't need to worry about sequence on inserting rows:
INSERT INTO T (collumn1, collumn2) VALUES (value1, value2);
