num_buckets (dba_tab_col_statistics) - oracle

I noticed when looking at dba_tab_col_statistics for my table (m_CURRENT) that the num_buckets value for column TNE is 75 on one database and 254 on another. DB is Oracle 10g.
This seems to be the main difference between the two tables. Is there a way to get both databases to match num_bucket values?
I have a delete statement that is fast on one database and very slow on another. I know there are several reasons a query plan can differ on two databases. After a lot of analysis, I suspect that getting the slow query database to have the same num_bucket setting could ensure my delete statement does a range_scan (fast) and not a fast_full_scan (slow in this case) on index TNE_idx.

how do you gather stats on both databases, do you have a regular stats gather script? as you can as a one off do this to gather histograms on that one column alone:
begin
dbms_stats.gather_table_stats(user,'M_CURRENT',
method_opt=>'for columns TNE size 254', cascade=>false,
granularity=>'ALL', degree=>8);
end;
/
the size parameter will set the buckets (if there's less than that number of distinct values, the result will be less).
The above doesn't specify estimate_pct so will sample a smaller population of values rather than 100%. if you want 100 percent then specify this in the estimate_pct parameter*), but if you have a regular script, this may get overwritten later on.
* you can check the current sample size by comparing sample_size on the dba_tab_col_statistics to the num_rows on dba_tables

I want to follow up, DazzaL's answer was very informative, but I was wrong, the buckets did not make a difference. I could have saved a lot of time by running
explain plan for delete blah blah blah
then
select * from table(dbms_xplan.display)
and focusing on the right step id.
I zeroed in on step id 5 where a index fast full scan was being used when I should have started on step 2 where the two database plans were different. I saw the fast database doing nested loop anti join and the slow database doing a merge join anti. Once I added the (undocumented?) hint
NL_AJ
with no parameters in the subquery, the desired fast index_range_scan was used.
As I write this I wonder if the bucket number matching was important, but I don't have the time to retest. It ain't broke, so I will not touch it!

Related

Speed up SQLite 'SELECT EXISTS' query

I've got a 3GB SQLite database file with a single table with 40 million rows and 14 fields (mostly integers and very short strings and one longer string), no indexes or keys or other constraints -- so really nothing fancy. I want to check if there are entries where a specific integer field has a specific value. So of course I'm using
SELECT EXISTS(SELECT 1 FROM FooTable WHERE barField=?)
I haven't got much experience with SQLite and databases in general and on my first test query, I was shocked that this simple query took about 30 seconds. Subsequent tests showed that it is much faster if a matching row occurs at the beginning, which of course makes sense.
Now I'm thinking of doing an initial SELECT DISTINCT barField FROM FooTable at application startup, and caching the results in software. But I'm sure there must be a cleaner SQLite way to do this, I mean, that should be part of what a DBMS's job right?
But so far, I've only created primary keys for speeding up queries, which doesn't work here because the field values are non-unique. So how can I speed up this query so that it works at constant time? (It doesn't have to be lightning fast, I'd be completely fine if it was under one second.)
Thanks in advance for answering!
P.S. Oh, and there will be about 500K new rows every month for an indefinite period of time, and it would be great if that doesn't significantly increase query time.
Adding an index on barField should speed up the subquery inside the EXISTS clause:
CREATE INDEX barIdx ON FooTable (barField);
To satisfy the query, SQLite would only have to seek the index once and detect that there is at least one matching value.

Postgresql 9.4 - FASTEST query to select and update on large dataset (>30M rows) with heavy writes/reads and locks

I want to select one row among a large dataset (>30Million rows) with heavy writes/reads RANDOMLY.
My problem I can't let the arbitrary pick to postgresql (that would have been the cheapest / fastest query, ust using 'limit 1') as it behaves erratically and in "obscure ways": see my initial problem here: postgresql 9.4 - prevent app selecting always the latest updated rows
Here is my current query
UPDATE opportunities s
SET opportunity_available = false
FROM (
SELECT id
FROM opportunities
WHERE deal_id = #{#deal.id}
AND opportunity_available
AND pg_try_advisory_xact_lock(id)
LIMIT 1
FOR UPDATE
) sub
WHERE s.id = sub.id
RETURNING s.prize_id, s.id;
// inspired by https://stackoverflow.com/questions/33128531/put-pg-try-advisory-xact-lock-in-a-nested-subquery
I asked a first question (postgresql 9.4 - prevent app selecting always the latest updated rows) but i think that even if there is no clear answer, what is happening is that postgresql is left free to make a arbitrary pick (as I only use 'LIMIT 1' because I wanted the cheapest/fastest query), which is VERY DIFFERENT from a RANDOM pick. But as a consequence, postgresql often outputs the latest rows updated by the administrator (which are always opportunities that have all prizes), versus really choosing randomly the row.
I think I need to move away from the arbitrary pick to get a RANDOM pick.
In that context what is the best choice i.e the fastest to select (notice the 'FOR UPDATE' and 'advisory locks' as i need to lock rows when they are being updated with for update to prevent concurrent calls...I'll use soon in postgresql 9.5 skip locked as soon as 9.5 goes out of beta)
Use order with random() but it is notoriously (read many many posts on on stackoverflow and stack exchange dba about this) to be REALLY really slow on large datasets => "ORDER BY RAND() is slow because the DBMS has to read all rows, sort them all, just to keep only a few rows. So the performance of this query heavily depends on the number of rows in the table, and decreases as the number of rows increase.", as explained here or here
Use offset is also know to be slow for large datasets
Use Sampling like explained/advised here by what seem big experts: https://www.periscopedata.com/blog/how-to-sample-rows-in-sql-273x-faster.html
Use another advanced technique you might suggest

Why does the same select statement have different costs in Oracle?

Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.

Oracle SQL: Select data between from certain row to another row

I want to select rows from say Nth row to Mth row in a table. I don't want to use any orderby because the table data is huge, it's 38 million. I found a solution for this which says to use the following query
SELECT *
FROM (select suppliers2.*, rownum rnum from
(select * from suppliers ORDER BY supplier_name) suppliers2
where rownum <= 5 )
WHERE rnum >= 3;
But since it has two select statement and my table is very big it's 38 million rows, I wanted to know if there is any other way which is not taxing to the DB. I could also see I can use minus but I again see problem with performance. I basically want to select the first one million rows and put it into file, then select the 2nd million rows and put it into file and so on. Please help.
It's not clear to me why you need to page through the results in the first place. You apparently want to grab an arbitrary 1 million rows, put that data in one file, grab another arbitrary 1 million rows (ensuring that you don't grab the same row twice), put that in a second file, and repeat the process until you've generated 38 separate files. What benefit do you derive from issuing 38 separate SELECT statements rather than issuing a single SELECT statement and letting the caller simply write the first million rows that it fetches to one file and then write the second million rows that it fetches to a second file?
Are you trying to generate the files in parallel from 38 separate worker processes? If so, it seems unlikely that you'll get much benefit from parallelizing the writes at the expense of increasing the amount of work that the database has to do substantially. I guess I could envision a system where writes were slow on the client but easy to parallelize while reads on the server were very fast and there was a ton of memory available for sorting on the database server that it might be quicker to write the files in parallel. But there aren't many systems with those characteristics. If you do want to use parallelism, you'd generally be better served letting the client issue a single SELECT to the database and allowing the database to run that SELECT statement in parallel.
If you are determined to select the results in pages, the query you posted should be the most efficient. The fact that there are nested select statements isn't particularly relevant to the analysis of performance. The query will only hit the table once. It still may be very expensive if it needs to fetch and sort all 38 million rows in order to determine which is the 3rd row and which is the 5th row. And it will likely get steadily slower when you look for subsequent pages of data. Fetching rows 37,000,001 - 38,000,000 will require, at a minimum, reading the entire table. That's one reason that it's unlikely to be all that helpful to write the files in parallel-- pulling the first few pages of data is likely to be so much more efficient than pulling the last page that you're going to be limited by that query and the time required to pull 38 million rows over the network.

(TSQL) INSERT doubling time of the query

I have a quite complex multi-join TSQL SELECT query that runs for about 8 seconds and returns about 300K records. Which is currently acceptable. But I need to reuse results of that query several times later, so I am inserting results of the query into a temp table. Table is created in advance with columns that match output of SELECT query. But as soon as I do INSERT INTO ... SELECT - execution time more than doubles to over 20 seconds! Execution plans shows that 46% of the query cost goes to "Table Insert" and 38% to Table Spool (Eager Spool).
Any idea why this is happening and how to speed it up?
Thanks!
The "Why" of it hard to say, we'd need a lot more information. (though my SWAG would be that it has to do with logging...)
However, the solution, 9 times out of 10 is to use SELECT INTO to make your temp table.
I would start by looking at standard tuning itmes. Is disk performing? Are there sufficient resources (IOs, RAM, CPU, etc)? Is there a bottleneck in the RDBMS? Does sound like the issue but what is happening with locking? Does other code give similar results? Is other code performant?
A few things I can suggest based on the information you have provided. If you don't care about dirty reads, you could always change the transaction isolation level (if you're using MS T-SQL)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select ...
This may speed things up on your initial query as locks will not need to be done on the data you are querying from. If you're not using SQL server, do a google search for how to do the same thing with the technology you are using.
For the insert portion, you said you are inserting into a temp table. Does your database support adding primary keys or indexes on your temp table? If it does, have a dummy column in there that is an indexed column. Also, have you tried to use a regular database table with this? Depending on your set up, it is possible that using that will speed up your insert times.

Resources