Query performance in PostgreSQL using 'similar to' - performance

I need to retrieve certain rows from a table depending on certain values in a specific column, named columnX in the example:
select *
from tableName
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
So if columnX contains at least one of the values specified (A, B, C, 1, 2, 3), I will keep the row.
I can't find a better approach than using similar to. The problem is that the query takes too long for a table with more than a million rows.
I've tried indexing it:
create index tableName_columnX_idx on tableName (columnX)
where columnX similar to ('%A%|%B%|%C%|%1%|%2%|%3%')
However, if the condition is variable (the values could be other than A, B, C, 1, 2, 3), I would need a different index for each condition.
Is there any better solution for this problem?
EDIT: Thanks everybody for the feedback. Looks like I've achieved to this point maybe because of a design mistake (topic I've posted in a separated question).

If you are only going to search lists of one-character values, then split each string into an array of characters and index the array:
CREATE INDEX
ix_tablename_columnxlist
ON tableName
USING GIN((REGEXP_SPLIT_TO_ARRAY(columnX, '')))
then search against the index:
SELECT *
FROM tableName
WHERE REGEXP_SPLIT_TO_ARRAY(columnX, '') && ARRAY['A', 'B', 'C', '1', '2', '3']

I agree with #Quassnoi, a GIN index is fastest and simplest - unless write performance or disk space are issues because it occupies a lot of space and eats quite a bit of performance for INSERT, UPDATE and DELETE.
My additional answer is triggered by your statement:
I can't find a better approach than using similar to.
If that is what you found, then your search isn't over, yet. SIMILAR TO is a complete waste of time. Literally. PostgreSQL only features it to comply to the (weird) SQL standard. Inspect the output of EXPLAIN ANALYZE for your query and you will find that SIMILAR TO has been replaced by a regular expression.
Internally every SIMILAR TO expression is rewritten to a regular expression. Consequently, for each and every SIMILAR TO expression there is at least one regular expression match that is a bit faster. Let EXPLAIN ANALYZE translate it for you, if you are not sure. You won't find this in the manual, PostgreSQL does not promise to do it this way, but I have yet to see an exception.
More details in this related answer on dba.SE.

This strikes me as a data modelling issue. You appear to be using a text field as a set, storing single character codes to identify values present in the set.
If so, I'd want to remodel this table to use one of the following approaches:
Standard relational normalization. Drop columnX, and replace it with a new table with a foreign key reference to tableName(id) and a charcode column that contains one character from the old columnX per row, like CREATE TABLE tablename_columnx_set(tablename_id integer not null references tablename(id), charcode "char", primary key (tablename_id, charcode)). You can then fairly efficiently search for keys in columnX using normal SQL subqueries, joins, etc. If your application can't cope with that change you could always keep columnX and maintain the side table using triggers.
Convert columnX to a hstore of keys with a dummy value. You can then use hstore operators like columnX ?| ARRAY['A','B','C']. A GiST index on the hstore of columnX should provide fairly solid performance for those operations.
Split to an array as recommended by Quassnoi if your table change rate is low and you can pay the costs of the GIN index;
Convert columnX to an array of integers, use intarray and the intarray GiST index. Have a mapping table of codes to integers or convert in the application.
Time permitting I'll follow up with demos of each. Making up the dummy data is a pain, so it'll depend on what else is going on.

I'll post this as an answer because it may guide other people in the future: Why not have 6 columns, haveA, haveB ~ have3 and do a 6-part OR query? Or use a bitmask?
If there are too many attributes to assign a column each, I might try creating an "attribute" table:
(fkey, attr) VALUES (1, 'A'), (1, 'B'), (2, '3')
and let the DBMS worry about the optimization.

Related

will index be used when UPPER() the variable first?

I may have encountered a full table scan in Oracle database. I can't excute the explain command in the database, simply put, I don't have the permission.
And I'm trying to figure out the following question.
If I have an index on NAME in table
With this query:
select OID
from table
where NAME=UPPER(v1)
and TYPE=v2
and PID=v3
and OID<>v4
and PID =v5`
(v1 is a variable)
Will the oracle use index on name to select OID?
I have read some material, and it says with a function in where condition the NAME index won't be used. But the upper() is a special function, so I'm not quiet sure about the material I saw before.
And here is the second question after the answer of #mathguy:
If I create an index using create index INDEX_NAME on table(upper(NAME));
will the query:
select OID,PID
from table
where PID=v1
and NAME=UPPER(v2)
use the index INDEX_NAME?
OR the index will be used in the above question, and the query is just not efficient so they take much time to execute?
If you have an index on name, then the optimizer MAY use the index in the example you gave. It may choose not to use it (for example if it estimates that a relatively large fraction of rows will be returned anyway); but if say only 0.1% of rows would be returned, by all means the index will be used. (If that still doesn't happen, make sure statistics are up-to-date.)
What will prevent the use of an index is if you wrapped name within upper(). What happens on the right-hand side - whether you have v1 or upper(v1) or even a much more complicated expression - is irrelevant as long as name doesn't also appear in that complicated expression on the right-hand side.
Perhaps this will help...
In Oracle, you can create an index on a function (a function index), so if you created your index on the function UPPER(NAME) instead of just NAME, Oracle may be more likely to use the index (although it still might choose not to depending on other factors.)
Here's a link that describes function indexes

Best datatype to store postal codes in oracle

I'm new to Oracle, I'm using oracle 11g. I'm storing postal codes of UK. Values are like these.
N22 5HF
SW1 4JD
N14 8IT
N22 1JT
E1 5DP
e1 8DS
E3 8TU
I should be able to easily compare first four characters of each postal code.
What is the best data type to store these data ?
As a slight variation on Lalit's answer, since you want the outward code rather than a fixed substring of the first four characters (which could incude a space and the start of the inward code), you can create a virtual column based on the first word of the value:
postcode varchar2(8),
outward_code generated always as
(substr(postcode, 1, instr(postcode, ' ', 1, 1) - 1))
And optionally, but probably if you're using this to search, an index on the virtual column.
This assumes the post codes are formatted properly in the first place. It won't work if you don't always have the space between the outward and inward codes. And to answer your original question, the actual post code should be a varchar2(8) column to hold alphanumeric valus up to the maximum size and with the standard format.
SQL Fiddle demo.
I should be able to easily compare first four characters of each postal code.
Then keep these first four characters in a separate column. And index this column. You could keep the other characters in different column. Now, if the codes are a mixture of alphanumeric characters, then you are left with VARCHAR2 data type.
Your query predicate would like -
WHERE post_code_col = substr('N22 5HF', 1, 4)
Thus the indexed column post_code_col would be efficient in performance.
On 11g, you have the option to create a virtual column. However, indexing it would be equivalent to a function-based index. So I woukd prefer the first way as I suggested above.
It is better to normalize the table during the design phase, else the issues would start creeping in later.
In my opinion you should use varchar2 data type because this field will not going to be in mathematical calculations (they should not be int or decimal) and these fields are not big enough (so this should not be text)

Full-text search in Postgres or CouchDB?

I took geonames.org and imported all their data of German cities with all districts.
If I enter "Hamburg", it lists "Hamburg Center, Hamburg Airport" and so on. The application is in a closed network with no access to the internet, so I can't access the geonames.org web services and have to import the data. :(
The city with all of its districts works as an auto complete. So each key hit results in an XHR request and so on.
Now my customer asked whether it is possible to have all data of the world in it. Finally, about 5.000.000 rows with 45.000.000 alternative names etc.
Postgres needs about 3 seconds per query which makes the auto complete unusable.
Now I thought of CouchDb, have already worked with it. My question:
I would like to post "Ham" and I want CouchDB to get all documents starting with "Ham". If I enter "Hamburg" I want it to return Hamburg and so forth.
Is CouchDB the right database for it? Which other DBs can you recommend that respond with low latency (may be in-memory) and millions of datasets? The dataset doesn't change regularly, it's rather static!
If I understand your problem right, probably all you need is already built in the CouchDB.
To get a range of documents with names beginning with e.g. "Ham". You may use a request with a string range: startkey="Ham"&endkey="Ham\ufff0"
If you need a more comprehensive search, you may create a view containing names of other places as keys. So you again can query ranges using the technique above.
Here is a view function to make this:
function(doc) {
for (var name in doc.places) {
emit(name, doc._id);
}
}
Also see the CouchOne blog post about CouchDB typeahead and autocomplete search and this discussion on the mailing list about CouchDB autocomplete.
Optimized search with PostgreSQL
Your search is anchored at the start and no fuzzy search logic is required. This is not the typical use case for full text search.
If it gets more fuzzy or your search is not anchored at the start, look here for more:
Similar UTF-8 strings for autocomplete field
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
In PostgreSQL you can make use of advanced index features that should make the query very fast. In particular look at operator classes and indexes on expressions.
1) text_pattern_ops
Assuming your column is of type text, you would use a special index for text pattern operators like this:
CREATE INDEX name_text_pattern_ops_idx
ON tbl (name text_pattern_ops);
SELECT name
FROM tbl
WHERE name ~~ ('Hambu' || '%');
This is assuming that you operate with a database locale other than C - most likely de_DE.UTF-8 in your case. You could also set up a database with locale 'C'. I quote the manual here:
If you do use the C locale, you do not need the xxx_pattern_ops
operator classes, because an index with the default operator class is
usable for pattern-matching queries in the C locale.
2) Index on expression
I'd imagine you would also want to make that search case insensitive. so let's take another step and make that an index on an expression:
CREATE INDEX lower_name_text_pattern_ops_idx
ON tbl (lower(name) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(name) ~~ (lower('Hambu') || '%');
To make use of the index, the WHERE clause has to match the the index expression.
3) Optimize index size and speed
Finally, you might also want to impose a limit on the number of leading characters to minimize the size of your index and speed things up even further:
CREATE INDEX lower_left_name_text_pattern_ops_idx
ON tbl (lower(left(name,10)) text_pattern_ops);
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%');
left() was introduced with Postgres 9.1. Use substring(name, 1,10) in older versions.
4) Cover all possible requests
What about strings with more than 10 characters?
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~ (lower(left('Hambu678910',10)) || '%');
AND lower(name) ~~ (lower('Hambu678910') || '%');
This looks redundant, but you need to spell it out this way to actually use the index. Index search will narrow it down to a few entries, the additional clause filters the rest. Experiment to find the sweet spot. Depends on data distribution and typical use cases. 10 characters seem like a good starting point. For more than 10 characters, left() effectively turns into a very fast and simple hashing algorithm that's good enough for many (but not all) use cases.
5) Optimize disc representation with CLUSTER
So, the predominant access pattern will be to retrieve a bunch of adjacent rows according to our index lower_left_name_text_pattern_ops_idx. And you mostly read and hardly ever write. This is a textbook case for CLUSTER. The manual:
When a table is clustered, it is physically reordered based on the index information.
With a huge table like yours, this can dramatically improve response time because all rows to be fetched are in the same or adjacent blocks on disk.
First call:
CLUSTER tbl USING lower_left_name_text_pattern_ops_idx;
Information which index to use will be saved and successive calls will re-cluster the table:
CLUSTER tbl;
CLUSTER; -- cluster all tables in the db that have previously been clustered.
If you don't want to repeat it:
ALTER TABLE tbl SET WITHOUT CLUSTER;
However, CLUSTER takes an exclusive lock on the table. If that's a problem, look into pg_repack or pg_squeeze, which can do the same without exclusive lock on the table.
6) Prevent too many rows in the result
Demand a minimum of, say, 3 or 4 characters for the search string. I add this for completeness, you probably do it anyway.
And LIMIT the number of rows returned:
SELECT name
FROM tbl
WHERE lower(left(name,10)) ~~ (lower('Hambu') || '%')
LIMIT 501;
If your query returns more than 500 rows, tell the user to narrow down his search.
7) Optimize filter method (operators)
If you absolutely must squeeze out every last microsecond, you can utilize operators of the text_pattern_ops family. Like this:
SELECT name
FROM tbl
WHERE lower(left(name, 10)) ~>=~ lower('Hambu')
AND lower(left(name, 10)) ~<=~ (lower('Hambu') || chr(2097151));
You gain very little with this last stunt. Normally, standard operators are the better choice.
If you do all that, search time will be reduced to a matter of milliseconds.
I think a better approach is keep your data on your database (Postgres or CouchDB) and index it with a full-text search engine, like Lucene, Solr or ElasticSearch.
Having said that, there's a project integrating CouchDB with Lucene.

How to "update" a column using pig latin

Imagine I have the following table available to me:
A: { x: int, y: int, z: int, ...99 other columns... }
I now want to transform this, such that z is set to NULL where x > y, with the resulting dataset to be stored as B.
and I want to do it without having to explicitly mention all the other columns, as this becomes a maintenance nightmare.
Is there a simple solution?
This issue is tracked in this JIRA:
PIG-1693 There needs to be a way in foreach to indicate "and all the rest of the fields"
Currently I don't know anything simpler than doing what you say or not loading Z and adding a new column Z with the star expression.
I was able to drop some of the column bloat by nesting them in single-row bags and flattening afterwards.
Still, it feels like a bit of a hack. So I'm also investigating cascading to see if it's a better fit for my scenario.
A feature to facilitate your scenario was added in Pig 0.9. The new project-range operator (..) allows you to express a range of fields by indicating the starting and/or ending field names as in this example:
result = FOREACH someInput GENERATE field1, field2, null as field3, field4 .. ;
In the example above field1/2/3/4 are actual field names. One of the fields is set to null while the other fields are kept intact.
More details in this "New Apache Pig 0.9 Features – Part 3" article: http://hortonworks.com/blog/new-apache-pig-0-9-features-part-3-additional-features/
To solve your specific problem you probably want to do a FILTER and an UNION to combine the results.
Of course you can select columns by column number, but that can easily become a nightmare if you change anything at all. I have found column names to be much more stable, and therefore I recommend the following solution:
Update mycol when it is between two known columns
You can use .. to indicate leading, or trailing columns (or inbetween columns). Here is how that would work out if you want to change the value of 'MyCol' to 'updatedvalue'.
aliasAfter = FOREACH aliasBefore GENERATE
.. colBeforeMyCol, updatedvalue, colAfterMyCol ..;

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources