PostgreSQL: Create index on length of all table fields - performance

I have a table called profile, and I want to order them by which ones are the most filled out. Each of the columns is either a JSONB column or a TEXT column. I don't need this to a great degree of certainty, so typically I've ordered as follow:
SELECT * FROM profile ORDER BY LENGTH(CONCAT(profile.*)) DESC;
However, this is slow, and so I want to create an index. However, this does not work:
CREATE INDEX index_name ON profile (LENGTH(CONCAT(*))
Nor does
CREATE INDEX index_name ON profile (LENGTH(CONCAT(CAST(* AS TEXT))))
Can't say I'm surprised. What is the right way to declare this index?

To measure the size of the row in text representation you can just cast the whole row to text, which is much faster than concatenating individual columns:
SELECT length(profile::text) FROM profile;
But there are 3 (or 4) issues with this expression in an index:
The syntax shorthand profile::text is not accepted in CREATE INDEX, you need to add extra parentheses or default to the standard syntax cast(profile AS text)
Still the same problem that #jjanes already discussed: only IMMUTABLE functions are allowed in index expressions and casting a row type to text does not pass this requirement. You could build a fake IMMUTABLE wrapper function, like Jeff outlined.
There is an inherent ambiguity (that applies to Jeff's answer as well!): if you have a column name that's the same as the table name (which is a common case) you cannot reference the row type in CREATE INDEX since the identifier always resolves to the column name first.
Minor difference to your original: This adds column separators, row decorators and possibly escape characters to the text representation. Shouldn't matter much to your use case.
However, I would suggest a more radical alternative as crude indicator for the size of a row: pg_column_size(). Even shorter and faster and avoids issues 1, 3 and 4:
SELECT pg_column_size(profile) FROM profile;
Issue 2 remains, though: pg_column_size() is also only STABLE. You can create a simple and cheap SQL wrapper function:
CREATE OR REPLACE FUNCTION pg_column_size(profile)
RETURNS int LANGUAGE sql IMMUTABLE AS
'SELECT pg_catalog.pg_column_size($1)';
and then proceed like #jjanes outlined. More details:
Does PostgreSQL support "accent insensitive" collations?
Note that I created the function with the row type profile as parameter. Postgres allows function overloading, which is why we can use the same function name. Now, when we feed the matching row type to pg_column_size() our custom function matches more closely according to function type resolution rules and is picked instead of the polymorphic system function. Alternatively, use a separate name and possibly make the function polymorphic as well ...
Related:
Is there a way to disable function overloading in Postgres

You can declare a function which is falsely marked "immutable" and build an index on that.
CREATE OR REPLACE FUNCTION len_immut(record)
RETURNS int
LANGUAGE plperl
IMMUTABLE
AS $function$
## This function lies about its immutability.
## Use it with care. It is useful for indexing
## entire table rows.
return length(join ",", values %{$_[0]});
$function$
and then
create index on profile (len_immut(profile));
SELECT * FROM profile ORDER BY len_immut(profile) DESC;
Since the function is falsely labelled as immutable, the index may become out of date if you do things like add or drop columns on the table, or change the types of columns.

Related

Validate Oracle Column Names

In one scenario we are dynamically creating sql to create temp tables on-fly. There is no issue with table_name as it is decided by us however the column-names are provided by sources not in our control.
Usually we would check the column names using below query:
select ..
where NOT REGEXP_LIKE (Column_Name_String,'^([a-zA-Z])[a-zA-Z0-9_]*$')
OR Column_Name_String is NULL
OR Length(Column_Name_String) > 30
However is there any build in function which can do a more extensive check. Also any input on the above query is welcome as well.
Thanks in advance.
Final query based on below answers:
select ..
where NOT REGEXP_LIKE (Column_Name_String,'^([a-zA-Z])[a-zA-Z0-9_]{0,29}$')
OR Column_Name_String is NULL
OR Upper(Column_Name_String) in (select Upper(RESERVED_WORDS.Keyword) from V$RESERVED_WORDS RESERVED_WORDS)
Particularly not happy with character's like $ in column name either hence won't be using..
dbms_assert.simple_sql_name('VALID_NAME')
Instead with regexp I can decide my own set of character's to allow.
This answer does not necessarily offer either a performance or logical improvement, but you can actually validate the column names using a single regex:
SELECT ...
WHERE NOT
REGEXP_LIKE (COALESCE(Column_Name_String, ''), '^([a-zA-Z])[a-zA-Z0-9_]{0,29}$')
This works because:
It uses the same pattern to match columns, i.e. starting with a letter and afterwards using only alphanumeric characters and underscore
NULL column names are mapped to empty string, which fails the regex
We use a length quantifier {0,29} to check the column length directly in the regex
" is there any build in function which can do a more extensive check."
Oracle has the DBMS_ASSERT.SIMPLE_SQL_NAME() function. This returns the passed name if it meets the Oracle naming rules ...
select dbms_assert.simple_sql_name('VALID_NAME') from dual;
... and hurls ORA-44003 if the name is invalid.
Valid names permit any characters if the name is double-quoted (yuck, but then so is creating "temp tables on-fly"). Also the function doesn't check the length of the name, so you will still need to validate that yourself.
Find out more in the docs.
Also here is a SQL Fiddle.
"creating a table with comment column is not possible as its a invalid identifier"
Fair point. DBMS_ASSERT is primarily aimed at preventing SQL injection. So it verifies that a value conforms to Oracle's naming rules, not that the value is a valid Oracle name. To catch things like comment you will also need to check the value against V$RESERVED_WORDS, probably where reserved != 'Y'. As this is a V$ view select on it is not granted by default; if you don't have access you'll need to ask your friendly DBA to help out.
" For validating column names I believe I should check with the entire list"
Up to you. The distinction is that some keywords can legitimately be used as identifiers. For instance TYPE only became a reserved word in Oracle version 8 when they introduced the object-relational stuff. But there were a lot of tables and views in existing systems which used 'TYPE' as a column name (not least the Oracle data dictionary). If Oracle had made TYPE a properly reserved word it would have broken all those systems. So the list of reserved words which cannot be used as identifiers is a sub-set of all the Oracle keywords.
Opinions on the general task:
"we are getting data from external sources (files) and the job of the program/script is to push that data to oracle tables."
There are two parts to this task.
The first is that you should have agreed a standard format for these files with the third parties. There should be no need for discovery of the files' structure or content. (Or if there is such a need because the files are randomly sourced from a carousel of third parties probably you should not be using a relational database but something else: Endeca? Python Pandas library?)
The second is the creating tables on the fly. If you have an agreed file structure then you should be loading into standard tables, using either SQL*Loader or external tables according to your circumstances. If you're on 12c maybe SQL*Loader Express Mode could be of interest.

will index be used when UPPER() the variable first?

I may have encountered a full table scan in Oracle database. I can't excute the explain command in the database, simply put, I don't have the permission.
And I'm trying to figure out the following question.
If I have an index on NAME in table
With this query:
select OID
from table
where NAME=UPPER(v1)
and TYPE=v2
and PID=v3
and OID<>v4
and PID =v5`
(v1 is a variable)
Will the oracle use index on name to select OID?
I have read some material, and it says with a function in where condition the NAME index won't be used. But the upper() is a special function, so I'm not quiet sure about the material I saw before.
And here is the second question after the answer of #mathguy:
If I create an index using create index INDEX_NAME on table(upper(NAME));
will the query:
select OID,PID
from table
where PID=v1
and NAME=UPPER(v2)
use the index INDEX_NAME?
OR the index will be used in the above question, and the query is just not efficient so they take much time to execute?
If you have an index on name, then the optimizer MAY use the index in the example you gave. It may choose not to use it (for example if it estimates that a relatively large fraction of rows will be returned anyway); but if say only 0.1% of rows would be returned, by all means the index will be used. (If that still doesn't happen, make sure statistics are up-to-date.)
What will prevent the use of an index is if you wrapped name within upper(). What happens on the right-hand side - whether you have v1 or upper(v1) or even a much more complicated expression - is irrelevant as long as name doesn't also appear in that complicated expression on the right-hand side.
Perhaps this will help...
In Oracle, you can create an index on a function (a function index), so if you created your index on the function UPPER(NAME) instead of just NAME, Oracle may be more likely to use the index (although it still might choose not to depending on other factors.)
Here's a link that describes function indexes

Query performance in PostgreSQL using 'similar to'

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:
select *
from tableName
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.
I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.
I've tried indexing it:
create index tableName_columnX_idx on tableName (columnX)
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.
Is there any better solution for this problem?
EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).
If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:
CREATE INDEX
ix_tablename_columnxlist
ON tableName
USING GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
then search against the index:
SELECT *
FROM tableName
WHERE REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']
I agree with #Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.
My additional answer is triggered by your statement:
I can't find a better approach than using similar to.
If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.
Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.
More details in this related answer on dba.SE.
This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.
If so, I'd want to remodel this table to use one of the following approaches:
Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.
Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.
I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?
If there are too many attributes to assign a column each, I might try creating an "attribute" table:
(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
and let the DBMS worry about the optimization.

Sqlite view vs plain select statement performance

I have a simple table (with about 8 columns and a LOT of rows) in a SQLite database. There is a single program that runs as a service and performs selects, updates and inserts on the table quite often (approximately every 5 minutes). The selects are used only to determine which rows are to be updated, and they are based on a column that holds boolean values (probably translated to integer internally by SQLite).
There is also a web application that performs selects (always with a GROUP BY clause) whenever a web user wishes to view part of the data.
There are two ways to ask for data through the web application: (a) predefined filters (i.e. the where clause has specific conditions on 3 specific columns) an (b) custom filters (i.e. the user chooses the values for the conditions, but the columns participating in the where clause are the same as in (a)). As mentioned, in both cases there is a GROUP BY operation.
I am wondering whether using a view or a custom function might increase the performance. Currently, a "custom" select may take more than 30 seconds to complete - and that's before any data has been sent back to the user.
EDIT:
Using EXPLAIN QUERY PLAN on a "predefined" select statement yields only one row:
0|0|TABLE mytable
Using EXPLAIN on the same query, yields the following:
0|OpenVirtual|1|4|keyinfo(2,-BINARY,BINARY)
1|OpenVirtual|2|3|keyinfo(1,BINARY)
2|MemInt|0|5|
3|MemInt|0|4|
4|Goto|0|27|
5|MemInt|1|5|
6|Return|0|0|
7|IfMemPos|4|9|
8|Return|0|0|
9|AggFinal|0|0|count(0)
10|AggFinal|2|1|sum(1)
11|MemLoad|0|0|
12|MemLoad|1|0|
13|MemLoad|2|0|
14|MakeRecord|3|0|
15|MemLoad|0|0|
16|MemLoad|1|0|
17|Sequence|1|0|
18|Pull|3|0|
19|MakeRecord|4|0|
20|IdxInsert|1|0|
21|Return|0|0|
22|MemNull|1|0|
23|MemNull|3|0|
24|MemNull|0|0|
25|MemNull|2|0|
26|Return|0|0|
27|Gosub|0|22|
28|Goto|0|82|
29|Integer|0|0|
30|OpenRead|0|2|
31|SetNumColumns|0|9|
32|Rewind|0|48|
33|Column|0|8|
34|String8|0|0|123456789
35|Le|356|39|collseq(BINARY)
36|Column|0|3|
37|Integer|180|0|
38|Gt|100|42|collseq(BINARY)
39|Column|0|7|
40|Integer|1|0|
41|Ne|356|47|collseq(BINARY)
42|Column|0|6|
43|Sequence|2|0|
44|Column|0|3|
45|MakeRecord|3|0|
46|IdxInsert|2|0|
47|Next|0|33|
48|Close|0|0|
49|Sort|2|69|
50|Column|2|0|
51|MemStore|7|0|
52|MemLoad|6|0|
53|Eq|512|58|collseq(BINARY)
54|MemMove|6|7|
55|Gosub|0|7|
56|IfMemPos|5|69|
57|Gosub|0|22|
58|AggStep|0|0|count(0)
59|Column|2|2|
60|Integer|30|0|
61|Add|0|0|
62|ToReal|0|0|
63|AggStep|2|1|sum(1)
64|Column|2|0|
65|MemStore|1|1|
66|MemInt|1|4|
67|Next|2|50|
68|Gosub|0|7|
69|OpenPseudo|3|0|
70|SetNumColumns|3|3|
71|Sort|1|80|
72|Integer|1|0|
73|Column|1|3|
74|Insert|3|0|
75|Column|3|0|
76|Column|3|1|
77|Column|3|2|
78|Callback|3|0|
79|Next|1|72|
80|Close|3|0|
81|Halt|0|0|
82|Transaction|0|0|
83|VerifyCookie|0|1|
84|Goto|0|29|
85|Noop|0|0|
The select I used was as the following
SELECT
COUNT(*) as number,
field1,
SUM(CAST(filter2 +30 AS float)) as column2
FROM
mytable
WHERE
(filter1 > '123456789' AND filter2 > 180)
OR filter3=1
GROUP BY
field1
ORDER BY
number DESC, field1;
Whenever you're going to be doing comparisons of a non-primary-key field, it's a good design idea to add an index into to the field(s). Too many, however, can cause INSERTs to crawl, so plan accordingly.
Also, if you have simple fields such as ones that only hold a boolean value, you may want to consider declaring it as an INTEGER instead of whatever you declared it as. Declaring it as any type not specifically defined by SQLite will cause it to default to a NUMERIC type which will take longer to compare values because it will store it internally as a double and will use the floating-point math processor instead of the integer math processor.
IMO, the GROUP BY sorting directive is sometimes a dead giveaway to an unoptimized query; its methodology involves eliminating redundant data which could have been eliminated beforehand if it hadn't been pulled out of the database to begin with.
EDIT:
I saw your query and saw there are some simple things you can do to optimize it:
SUM(CAST(filter2 +30 AS float)) is inefficient; why are you casting it as a float? Why not just SUM it then add 30 * the COUNT?
filter1 > '123456789' - Why the string comparison? Why not just use integer comparison?

Oracle runtime of comparing numbers versus comparing strings using a LIKE operator

My company database has 20 different string formats for their primary product label. All 20 of them are stored in a separate look-up table
1 are strings starting with 'W'
2 are strings starting with 'TAIC'
3 are strings starting with 'D'
...
Next to the label attribute is the 'type' attribute, which stores the number related to which prefix the label contains.
I'm tasked with updating one of our modules for better runtime. One of the queries I ran across deals with all labels containing 'TAIC' as the prefix. However, instead of comparing whether the type number is equal to 2, it runs a LIKE operation checking for each label that begins with TAIC.
Now, my question is this -- since my goal is for better run time, would it be wise to switch from the like operator to just a regular equality operation against the type attribute? It seems that running a regular expression-ish operation against a string would be a bit more time consuming, but enough to significantly alter the run time of a system?
In Oracle, both these operations:
SELECT *
FROM mytable
WHERE pk LIKE 'TAIC%'
and
SELECT *
FROM mytable
WHERE type = 2
are sargable, that is able to use an index on the appropriate fields.
The numeric index, however, would be more compact and hence require less time to traverse, so using numeric comparison could increase the query performance.

Resources