Yesterday I wanted to add a boolean field to an Oracle table. However, there isn't actually a boolean data type in Oracle. Does anyone here know the best way to simulate a boolean? Googling the subject discovered several approaches
Use an integer and just don't bother assigning anything other than 0 or 1 to it.
Use a char field with 'Y' or 'N' as the only two values.
Use an enum with the CHECK constraint.
Do experienced Oracle developers know which approach is preferred/canonical?
I found this link useful.
Here is the paragraph highlighting some of the pros/cons of each approach.
The most commonly seen design is to imitate the many Boolean-like
flags that Oracle's data dictionary views use, selecting 'Y' for true
and 'N' for false. However, to interact correctly with host
environments, such as JDBC, OCCI, and other programming environments,
it's better to select 0 for false and 1 for true so it can work
correctly with the getBoolean and setBoolean functions.
Basically they advocate method number 2, for efficiency's sake, using
values of 0/1 (because of interoperability with JDBC's getBoolean() etc.) with a check constraint
a type of CHAR (because it uses less space than NUMBER).
Their example:
create table tbool (bool char check (bool in (0,1));
insert into tbool values(0);
insert into tbool values(1);`
Oracle itself uses Y/N for Boolean values. For completeness it should be noted that pl/sql has a boolean type, it is only tables that do not.
If you are using the field to indicate whether the record needs to be processed or not you might consider using Y and NULL as the values. This makes for a very small (read fast) index that takes very little space.
To use the least amount of space you should use a CHAR field constrained to 'Y' or 'N'. Oracle doesn't support BOOLEAN, BIT, or TINYINT data types, so CHAR's one byte is as small as you can get.
The best option is 0 and 1 (as numbers - another answer suggests 0 and 1 as CHAR for space-efficiency but that's a bit too twisted for me), using NOT NULL and a check constraint to limit contents to those values. (If you need the column to be nullable, then it's not a boolean you're dealing with but an enumeration with three values...)
Advantages of 0/1:
Language independent. 'Y' and 'N' would be fine if everyone used it. But they don't. In France they use 'O' and 'N' (I have seen this with my own eyes). I haven't programmed in Finland to see whether they use 'E' and 'K' there - no doubt they're smarter than that, but you can't be sure.
Congruent with practice in widely-used programming languages (C, C++, Perl, Javascript)
Plays better with the application layer e.g. Hibernate
Leads to more succinct SQL, for example, to find out how many bananas are ready to eat select sum(is_ripe) from bananas instead of select count(*) from bananas where is_ripe = 'Y' or even (yuk) select sum(case is_ripe when 'Y' then 1 else 0) from bananas
Advantages of 'Y'/'N':
Takes up less space than 0/1
It's what Oracle suggests, so might be what some people are more used to
Another poster suggested 'Y'/null for performance gains. If you've proven that you need the performance, then fair enough, but otherwise avoid since it makes querying less natural (some_column is null instead of some_column = 0) and in a left join you'll conflate falseness with nonexistent records.
Either 1/0 or Y/N with a check constraint on it. ether way is fine. I personally prefer 1/0 as I do alot of work in perl, and it makes it really easy to do perl Boolean operations on database fields.
If you want a really in depth discussion of this question with one of Oracles head honchos, check out what Tom Kyte has to say about this Here
The database I did most of my work on used 'Y' / 'N' as booleans. With that implementation, you can pull off some tricks like:
Count rows that are true:
SELECT SUM(CASE WHEN BOOLEAN_FLAG = 'Y' THEN 1 ELSE 0) FROM X
When grouping rows, enforce "If one row is true, then all are true" logic:
SELECT MAX(BOOLEAN_FLAG) FROM Y
Conversely, use MIN to force the grouping false if one row is false.
A working example to implement the accepted answer by adding a "Boolean" column to an existing table in an oracle database (using number type):
ALTER TABLE my_table_name ADD (
my_new_boolean_column number(1) DEFAULT 0 NOT NULL
CONSTRAINT my_new_boolean_column CHECK (my_new_boolean_column in (1,0))
);
This creates a new column in my_table_name called my_new_boolean_column with default values of 0. The column will not accept NULL values and restricts the accepted values to either 0 or 1.
In our databases we use an enum that ensures we pass it either TRUE or FALSE. If you do it either of the first two ways it is too easy to either start adding new meaning to the integer without going through a proper design, or ending up with that char field having Y, y, N, n, T, t, F, f values and having to remember which section of code uses which table and which version of true it is using.
Related
I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
UPDATE:
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.
I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
http://www.postgresql.org/docs/9.2/static/datatype-enum.html
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
name text,
current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.
I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.
I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:
select *
from tableName
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.
I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.
I've tried indexing it:
create index tableName_columnX_idx on tableName (columnX)
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.
Is there any better solution for this problem?
EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).
If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:
CREATE INDEX
ix_tablename_columnxlist
ON tableName
USING GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
then search against the index:
SELECT *
FROM tableName
WHERE REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
I agree with #Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.
My additional answer is triggered by your statement:
I can't find a better approach than using similar to.
If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.
Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.
More details in this related answer on dba.SE.
This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.
If so, I'd want to remodel this table to use one of the following approaches:
Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.
Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.
I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?
If there are too many attributes to assign a column each, I might try creating an "attribute" table:
(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
and let the DBMS worry about the optimization.
I have a simple table (with about 8 columns and a LOT of rows) in a SQLite database. There is a single program that runs as a service and performs selects, updates and inserts on the table quite often (approximately every 5 minutes). The selects are used only to determine which rows are to be updated, and they are based on a column that holds boolean values (probably translated to integer internally by SQLite).
There is also a web application that performs selects (always with a GROUP BY clause) whenever a web user wishes to view part of the data.
There are two ways to ask for data through the web application: (a) predefined filters (i.e. the where clause has specific conditions on 3 specific columns) an (b) custom filters (i.e. the user chooses the values for the conditions, but the columns participating in the where clause are the same as in (a)). As mentioned, in both cases there is a GROUP BY operation.
I am wondering whether using a view or a custom function might increase the performance. Currently, a "custom" select may take more than 30 seconds to complete - and that's before any data has been sent back to the user.
EDIT:
Using EXPLAIN QUERY PLAN on a "predefined" select statement yields only one row:
0|0|TABLE mytable
Using EXPLAIN on the same query, yields the following:
0|OpenVirtual|1|4|keyinfo(2,-BINARY,BINARY)
1|OpenVirtual|2|3|keyinfo(1,BINARY)
2|MemInt|0|5|
3|MemInt|0|4|
4|Goto|0|27|
5|MemInt|1|5|
6|Return|0|0|
7|IfMemPos|4|9|
8|Return|0|0|
9|AggFinal|0|0|count(0)
10|AggFinal|2|1|sum(1)
11|MemLoad|0|0|
12|MemLoad|1|0|
13|MemLoad|2|0|
14|MakeRecord|3|0|
15|MemLoad|0|0|
16|MemLoad|1|0|
17|Sequence|1|0|
18|Pull|3|0|
19|MakeRecord|4|0|
20|IdxInsert|1|0|
21|Return|0|0|
22|MemNull|1|0|
23|MemNull|3|0|
24|MemNull|0|0|
25|MemNull|2|0|
26|Return|0|0|
27|Gosub|0|22|
28|Goto|0|82|
29|Integer|0|0|
30|OpenRead|0|2|
31|SetNumColumns|0|9|
32|Rewind|0|48|
33|Column|0|8|
34|String8|0|0|123456789
35|Le|356|39|collseq(BINARY)
36|Column|0|3|
37|Integer|180|0|
38|Gt|100|42|collseq(BINARY)
39|Column|0|7|
40|Integer|1|0|
41|Ne|356|47|collseq(BINARY)
42|Column|0|6|
43|Sequence|2|0|
44|Column|0|3|
45|MakeRecord|3|0|
46|IdxInsert|2|0|
47|Next|0|33|
48|Close|0|0|
49|Sort|2|69|
50|Column|2|0|
51|MemStore|7|0|
52|MemLoad|6|0|
53|Eq|512|58|collseq(BINARY)
54|MemMove|6|7|
55|Gosub|0|7|
56|IfMemPos|5|69|
57|Gosub|0|22|
58|AggStep|0|0|count(0)
59|Column|2|2|
60|Integer|30|0|
61|Add|0|0|
62|ToReal|0|0|
63|AggStep|2|1|sum(1)
64|Column|2|0|
65|MemStore|1|1|
66|MemInt|1|4|
67|Next|2|50|
68|Gosub|0|7|
69|OpenPseudo|3|0|
70|SetNumColumns|3|3|
71|Sort|1|80|
72|Integer|1|0|
73|Column|1|3|
74|Insert|3|0|
75|Column|3|0|
76|Column|3|1|
77|Column|3|2|
78|Callback|3|0|
79|Next|1|72|
80|Close|3|0|
81|Halt|0|0|
82|Transaction|0|0|
83|VerifyCookie|0|1|
84|Goto|0|29|
85|Noop|0|0|
The select I used was as the following
SELECT
COUNT(*) as number,
field1,
SUM(CAST(filter2 +30 AS float)) as column2
FROM
mytable
WHERE
(filter1 > '123456789' AND filter2 > 180)
OR filter3=1
GROUP BY
field1
ORDER BY
number DESC, field1;
Whenever you're going to be doing comparisons of a non-primary-key field, it's a good design idea to add an index into to the field(s). Too many, however, can cause INSERTs to crawl, so plan accordingly.
Also, if you have simple fields such as ones that only hold a boolean value, you may want to consider declaring it as an INTEGER instead of whatever you declared it as. Declaring it as any type not specifically defined by SQLite will cause it to default to a NUMERIC type which will take longer to compare values because it will store it internally as a double and will use the floating-point math processor instead of the integer math processor.
IMO, the GROUP BY sorting directive is sometimes a dead giveaway to an unoptimized query; its methodology involves eliminating redundant data which could have been eliminated beforehand if it hadn't been pulled out of the database to begin with.
EDIT:
I saw your query and saw there are some simple things you can do to optimize it:
SUM(CAST(filter2 +30 AS float)) is inefficient; why are you casting it as a float? Why not just SUM it then add 30 * the COUNT?
filter1 > '123456789' - Why the string comparison? Why not just use integer comparison?
My company database has 20 different string formats for their primary product label. All 20 of them are stored in a separate look-up table
1 are strings starting with 'W'
2 are strings starting with 'TAIC'
3 are strings starting with 'D'
...
Next to the label attribute is the 'type' attribute, which stores the number related to which prefix the label contains.
I'm tasked with updating one of our modules for better runtime. One of the queries I ran across deals with all labels containing 'TAIC' as the prefix. However, instead of comparing whether the type number is equal to 2, it runs a LIKE operation checking for each label that begins with TAIC.
Now, my question is this -- since my goal is for better run time, would it be wise to switch from the like operator to just a regular equality operation against the type attribute? It seems that running a regular expression-ish operation against a string would be a bit more time consuming, but enough to significantly alter the run time of a system?
In Oracle, both these operations:
SELECT *
FROM mytable
WHERE pk LIKE 'TAIC%'
and
SELECT *
FROM mytable
WHERE type = 2
are sargable, that is able to use an index on the appropriate fields.
The numeric index, however, would be more compact and hence require less time to traverse, so using numeric comparison could increase the query performance.
The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers