performance of rand() - performance

I have heard that I should avoid using 'order by rand()', but I really need to use it. Unlike what I have been hearing, the following query comes up very fast.
select
cp1.img_id as left_id,
cp1.img_filename as left_filename,
cp1.facebook_name as left_facebook_name,
cp2.img_id as right_id,
cp2.img_filename as right_filename,
cp2.facebook_name as right_facebook_name
from
challenge_photos as cp1
cross join
challenge_photos as cp2
where
(cp1.img_id < cp2.img_id)
and
(cp1.img_id,cp2.img_id) not in ((0,0))
and
(cp1.img_status = 1 and cp2.img_status = 1)
order by rand() limit 1
is this query considered 'okay'? or should I use queries that I can find by searching "alternative to rand()" ?

It's usually a performance thing. You should avoid, as much as possible, per-row functions since they slow down your queries.
That means things like uppercase(name), salary * 1.1 and so on. It also includes rand(). It may not be an immediate problem (at 10,000 rows) but, if you ever want your database to scale, you should keep it in mind.
The two main issues are the fact that you're performing a per-row function and then having to do a full sort on the output before selecting the first row. The DBMS cannot use an index if you sort on a random value.
But, if you need to do it (and I'm not making judgement calls there), then you need to do it. Pragmatism often overcomes dogmatism in the real world :-)
A possibility, if performance ever becomes an issue, is to get a count of the records with something like:
select count(*) from ...
then choose a random value on the client side and use a:
limit <start>, <count>
clause in another select, adjusting for the syntax used by your particular DBMS. This should remove the sorting issue and the transmission of unneeded data across the wire.

Related

How to take data in portions from Oracle using Mybatis?

In my application I am making a query to oracle and getting data this way
<select id="getAll" resultType="com.mappers.MyOracleMapper">
SELECT * FROM "OracleTable"
</select>
I get all the data, the problem is that there is a lot of data and it will take too much time to process all the data at once, since the response from the database will come in 3-4 minutes, this is not convenient.
How to make it so that I receive lines in portions without using the id field (since it does not exist, I do not know why). That is, take the first portion of lines, for example, the first 50, process them and take the next portion. It would be desirable to place a variable in properties that will be responsible for the number of lines in portions.
I can't do this in mybatis. This is new to me. Thanks in advance.
there is such a field and it is unique
OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
don't work, because the version is earlier than 12c
If you want to read millions of rows that's going to take time. It's normal to expect a few minutes to read and receive all the data over the wire.
Now, you have two options:
Use a Cursor
In MyBatis you can read the result of the query using the buffering a cursor gives you. The cursor reads a few hundred rows at a time and your app reads them one by one. Your app doesn't notice that behind the scenes there is buffering. Pretty good. For example, you can do:
Cursor<Client> clients = this.sqlSession.selectCursor("getAll");
for (Client c : clients) {
// process one client
}
Consider that cursors remain open until the end of the transaction. If you close the transaction (or exit the method marked as #Transactional) the cursor won't be usable anymore.
Use Manual Pagination
This solution can work well for the first pages of the result set, but it becomes increasingly inefficient and slooooooow the more you advance in the result set. Use it only as a last resort.
The only case where this strategy can be efficient is when you have the chance of implementing "key set pagination". I assume it's not the case here.
You can modify your query to perform explicit pagination. For example, you can do:
<select id="getPage" resultType="com.mappers.MyOracleMapper">
select * from (
SELECT rownum rnum, x.*
FROM OracleTable
WHERE rownum <= #{endingRow}
ORDER BY id
) x
where rnum >= #{startingRow}
</select>
You'll need to provide the extra parameters startingRow and endingRow.
NOTE: It's imperative you include an ORDER BY clause. Otherwise the pagination logic is meaningless. Choose any ordering you want, preferrably something that is backed up by an existing index.

SQLite - Exploiting Sorted Indexes

This is probably simple, but I can't find the answer.
I'm trying to minimise the overhead of selecting records using ORDER BY
My understanding is that in...
SELECT gorilla, chimp FROM apes ORDER BY bananas LIMIT 10;
...the full set of matching records is retrieved so that that the ORDER BY can be actioned, even if I only want the top ten records. This makes sense.
Trying to eliminate that overhead, I looked at the possibility of storing the records in a pre-defined order, but that would only work until insertions/deletions took place, upon which I would have to re-build the table. Not viable.
I found an option in SQLite (I assume it also exists in other SQLs) to create a sorted index (https://www.sqlite.org/lang_createindex.html)...
CREATE INDEX index_name ON apes (bananas DESC);
...which I ASSUME to mean that the index (not the table) is sorted in descending order and will remain so after updates .
My question is - how do I exploit this? The SQLite documentation is a bit meh in this regard. Is there some kind of "SELECT FROM index" or equivalent? Or does the fact that a sorted index exists on a column mean that any results from querying that column will be returned in the order of the index rather than the order of the column?
Or am I missing something entirely?
I'm working with SQLite3, queried by PHP 7.1
ORDER BY with LIMIT is a little bit more efficient than a plain ORDER BY because only the first few rows need to be completely sorted.
Anyway, for a single-column index, the sort order (ASC or DESC) is pointless because SQLite can step through an index either forwards or backwards.
Indexes are used automatically when SQLite estimates that they would be useful.
To check what actually happens, run EXPLAIN QUERY PLAN (or set .eqp on in the sqlite3 shell).

pl/sql query optimization with function call in where clause

I am trying to optimize a query where I am using a function() call in the where clause.
The function() simply changes the timezone of the date.
When I call the function as part of the SELECT, it executes extremely fast (< 0.09 sec against table of many hundreds of thousands of rows)
select
id,
fn_change_timezone (date_time, 'UTC', 'US/Central') AS tz_date_time,
value
from a_table_view
where id = 'keyvalue'
and date_time = to_date('01-10-2014','mm-dd-yyyy')
However, this version runs "forever" [meaning I stop it after umpteen minutes]
select id, date_time, value
from a_table_view
where id = 'keyvalue'
and fn_change_timezone (date_time, 'UTC', 'US/Central') = to_date('01-10-2014','mm-dd-yyyy')
(I know I'd have to change the date being compared, its just for example)
So my question is two-fold:
If the function is so fast outside of the where clause, why is it so much slower than say using TRUNC() or other functions (obviously trunc() doesnt do a table lookup like my function - but still the function is very very fast outside the where clause)
What are alternate ways of accomplishing this outside of the where clause ?
I tried this as an alternative, which did not seem any better, it still ran until I stopped the query:
select
tz.date_time,
v.id,
v.value
from
(select
fn_change_timezone(to_date('01/10/2014-00:00:00', 'mm/dd/yyyy-hh24:mi:ss'), 'UTC', 'US/Central') as date_time
from dual
) tz
inner join
(
select
id,
fn_change_timezone (date_time, 'UTC', 'US/Central') AS v_date_time,
value
from a_table_view
where id = 'keyvalue'
) v ON
v.tz_date_time = tz.date_time
Hopefully I am explaining the issue well.
There are at least four potential issues with using functions in the WHERE clause:
Functions may prevent indexes. A function-based index can solve this issue.
Functions may prevent partition pruning. Hard-coding values or maybe virtual column partitioning are possible solutions, although neither is likely helpful in this case.
Functions may run slowly. Even if the function is cheap, it is often very expensive to switch between SQL and PL/SQL. Some possible solutions are DETERMINISTIC, PARALLEL_ENABLE, function result caching, defining the logic in purely SQL, or with 12c defining the function in SQL.
Functions may cause bad cardinality estimates. It's hard enough for the optimizer to guess the result of normal conditions, adding procedural code makes it even more difficult. Using ASSOCIATE STATISTICS it is possible to provide some information to the optimizer about the cost and cardinality of the function.
Without more information, such as an explain plan, it is difficult to know what the specific issue is with this query.
Function calls in the WHERE clause are a Bad Thing. The problem is that the function may be called for every row in the table, which may be many more than the selected set. This can be a real performance killer (don't ask me how I know :-). In the first version with the function call in the SELECT list the function will only be called when a row has been chosen and is being added to the result set - in the second version the function may well be called for every row in the table. Also, depending on the version of Oracle you're using there may be significant overhead to calling a user function from SQL, but I think this penalty has been largely eliminated in versions since 10g.
Best of luck.
Share and enjoy.

How to inline a variable in PL/SQL?

The Situation
I have some trouble with my query execution plan for a medium-sized query over a large amount of data in Oracle 11.2.0.2.0. In order to speed things up, I introduced a range filter that does roughly something like this:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
-- [...]
As you can see, I want to restrict the JOIN of organisations using an optional range of organisation numbers. Client code can call DO_STUFF with (supposed to be fast) or without (very slow) the restriction.
The Trouble
The trouble is, PL/SQL will create bind variables for the above org_from and org_to parameters, which is what I would expect in most cases:
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((:B1 IS NULL) OR (:B1 <= org.no))
AND ((:B2 IS NULL) OR (:B2 >= org.no)))
-- [...]
The Workaround
Only in this case, I measured the query execution plan to be a lot better when I just inline the values, i.e. when the query executed by Oracle is actually something like
-- [...]
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
-- [...]
By "a lot", I mean 5-10x faster. Note that the query is executed very rarely, i.e. once a month. So I don't need to cache the execution plan.
My questions
How can I inline values in PL/SQL? I know about EXECUTE IMMEDIATE, but I would prefer to have PL/SQL compile my query, and not do string concatenation.
Did I just measure something that happened by coincidence or can I assume that inlining variables is indeed better (in this case)? The reason why I ask is because I think that bind variables force Oracle to devise a general execution plan, whereas inlined values would allow for analysing very specific column and index statistics. So I can imagine that this is not just a coincidence.
Am I missing something? Maybe there is an entirely other way to achieve query execution plan improvement, other than variable inlining (note I have tried quite a few hints as well but I'm not an expert on that field)?
In one of your comments you said:
"Also I checked various bind values.
With bind variables I get some FULL
TABLE SCANS, whereas with hard-coded
values, the plan looks a lot better."
There are two paths. If you pass in NULL for the parameters then you are selecting all records. Under those circumstances a Full Table Scan is the most efficient way of retrieving data. If you pass in values then indexed reads may be more efficient, because you're only selecting a small subset of the information.
When you formulate the query using bind variables the optimizer has to take a decision: should it presume that most of the time you'll pass in values or that you'll pass in nulls? Difficult. So look at it another way: is it more inefficient to do a full table scan when you only need to select a sub-set of records, or to do indexed reads when you need to select all records?
It seems as though the optimizer has plumped for full table scans as being the least inefficient operation to cover all eventualities.
Whereas when you hard code the values the Optimizer knows immediately that 10 IS NULL evaluates to FALSE, and so it can weigh the merits of using indexed reads for find the desired sub-set records.
So, what to do? As you say this query is only run once a month I think it would only require a small change to business processes to have separate queries: one for all organisations and one for a sub-set of organisations.
"Btw, removing the :R1 IS NULL clause
doesn't change the execution plan
much, which leaves me with the other
side of the OR condition, :R1 <=
org.no where NULL wouldn't make sense
anyway, as org.no is NOT NULL"
Okay, so the thing is you have a pair of bind variables which specify a range. Depending on the distribution of values, different ranges might suit different execution plans. That is, this range would (probably) suit an indexed range scan...
WHERE org.id BETWEEN 10 AND 11
...whereas this is likely to be more fitted to a full table scan...
WHERE org.id BETWEEN 10 AND 1199999
That is where Bind Variable Peeking comes into play.
(depending on distribution of values, of course).
Since the query plans are actually consistently different, that implies that the optimizer's cardinality estimates are off for some reason. Can you confirm from the query plans that the optimizer expects the conditions to be insufficiently selective when bind variables are used? Since you're using 11.2, Oracle should be using adaptive cursor sharing so it shouldn't be a bind variable peeking issue (assuming you are calling the version with bind variables many times with different NO values in your testing.
Are the cardinality estimates on the good plan actually correct? I know you said that the statistics on the NO column are accurate but I would be suspicious of a stray histogram that may not be updated by your regular statistics gathering process, for example.
You could always use a hint in the query to force a particular index to be used (though using a stored outline or optimizer plan stability would be preferable from a long-term maintenance perspective). Any of those options would be preferable to resorting to dynamic SQL.
One additional test to try, however, would be to replace the SQL 99 join syntax with Oracle's old syntax, i.e.
SELECT <<something>>
FROM <<some other table>> cust,
organization org
WHERE cust.org_id = org.id
AND ( ((org_from IS NULL) OR (org_from <= org.no))
AND ((org_to IS NULL) OR (org_to >= org.no)))
That obviously shouldn't change anything, but there have been parser issues with the SQL 99 syntax so that's something to check.
It smells like Bind Peeking, but I am only on Oracle 10, so I can't claim the same issue exists in 11.
This looks a lot like a need for Adaptive Cursor Sharing, combined with SQLPlan stability.
I think what is happening is that the capture_sql_plan_baselines parameter is true. And the same for use_sql_plan_baselines. If this is true, the following is happening:
The first time that a query started it is parsed, it gets a new plan.
The second time, this plan is stored in the sql_plan_baselines as an accepted plan.
All following runs of this query use this plan, regardless of what the bind variables are.
If Adaptive Cursor Sharing is already active,the optimizer will generate a new/better plan, store it in the sql_plan_baselines but is not able to use it, until someone accepts this newer plan as an acceptable alternative plan. Check dba_sql_plan_baselines and see if your query has entries with accepted = 'NO' and verified = null
You can use dbms_spm.evolve to evolve the new plan and have it automatically accepted if the performance of the plan is at least 1,5 times better than without the new plan.
I hope this helps.
I added this as a comment, but will offer up here as well. Hope this isn't overly simplistic, and looking at the detailed responses I may be misunderstanding the exact problem, but anyway...
Seems your organisations table has column no (org.no) that is defined as a number. In your hardcoded example, you use numbers to do the compares.
JOIN organisations org
ON (cust.org_id = org.id
AND ((10 IS NULL) OR (10 <= org.no))
AND ((20 IS NULL) OR (20 >= org.no)))
In your procedure, you are passing in varchar2:
PROCEDURE DO_STUFF(
org_from VARCHAR2 := NULL,
org_to VARCHAR2 := NULL)
So to compare varchar2 to number, Oracle will have to do the conversions, so this may cause the full scans.
Solution: change proc to pass in numbers

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources