I want to select rows from say Nth row to Mth row in a table. I don't want to use any orderby because the table data is huge, it's 38 million. I found a solution for this which says to use the following query
SELECT *
FROM (select suppliers2.*, rownum rnum from
(select * from suppliers ORDER BY supplier_name) suppliers2
where rownum <= 5 )
WHERE rnum >= 3;
But since it has two select statement and my table is very big it's 38 million rows, I wanted to know if there is any other way which is not taxing to the DB. I could also see I can use minus but I again see problem with performance. I basically want to select the first one million rows and put it into file, then select the 2nd million rows and put it into file and so on. Please help.
It's not clear to me why you need to page through the results in the first place. You apparently want to grab an arbitrary 1 million rows, put that data in one file, grab another arbitrary 1 million rows (ensuring that you don't grab the same row twice), put that in a second file, and repeat the process until you've generated 38 separate files. What benefit do you derive from issuing 38 separate SELECT statements rather than issuing a single SELECT statement and letting the caller simply write the first million rows that it fetches to one file and then write the second million rows that it fetches to a second file?
Are you trying to generate the files in parallel from 38 separate worker processes? If so, it seems unlikely that you'll get much benefit from parallelizing the writes at the expense of increasing the amount of work that the database has to do substantially. I guess I could envision a system where writes were slow on the client but easy to parallelize while reads on the server were very fast and there was a ton of memory available for sorting on the database server that it might be quicker to write the files in parallel. But there aren't many systems with those characteristics. If you do want to use parallelism, you'd generally be better served letting the client issue a single SELECT to the database and allowing the database to run that SELECT statement in parallel.
If you are determined to select the results in pages, the query you posted should be the most efficient. The fact that there are nested select statements isn't particularly relevant to the analysis of performance. The query will only hit the table once. It still may be very expensive if it needs to fetch and sort all 38 million rows in order to determine which is the 3rd row and which is the 5th row. And it will likely get steadily slower when you look for subsequent pages of data. Fetching rows 37,000,001 - 38,000,000 will require, at a minimum, reading the entire table. That's one reason that it's unlikely to be all that helpful to write the files in parallel-- pulling the first few pages of data is likely to be so much more efficient than pulling the last page that you're going to be limited by that query and the time required to pull 38 million rows over the network.
Related
Here’s my situation. There’s an imported table that has about a million records. I need to perform a lot of updates, inserts, etc. to this and other tables based on what is in each record.
I could do this using a few set-based SQL statements. However, the DBA’s don’t want me to use that approach being that a lot of the actions are touching tables that are used a lot and they don’t want me locking a lot of records at once.
However I don’t want to do a cursor line-by-line approach.
I would like to test out doing a batch of hundred rows at a time as follows:
Add a batch_number field to the imported table and populate it with an incremented integer for every 100 rows.
Then loop from 1 thru max batch_number. In that loop I’ll use SQL set-based ETL approach with an additional WHERE clause in each statement that has: WHERE batch_number = loop-number.
Is this a sound approach or is there an alternative better one?
Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?
It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.
I noticed when looking at dba_tab_col_statistics for my table (m_CURRENT) that the num_buckets value for column TNE is 75 on one database and 254 on another. DB is Oracle 10g.
This seems to be the main difference between the two tables. Is there a way to get both databases to match num_bucket values?
I have a delete statement that is fast on one database and very slow on another. I know there are several reasons a query plan can differ on two databases. After a lot of analysis, I suspect that getting the slow query database to have the same num_bucket setting could ensure my delete statement does a range_scan (fast) and not a fast_full_scan (slow in this case) on index TNE_idx.
how do you gather stats on both databases, do you have a regular stats gather script? as you can as a one off do this to gather histograms on that one column alone:
begin
dbms_stats.gather_table_stats(user,'M_CURRENT',
method_opt=>'for columns TNE size 254', cascade=>false,
granularity=>'ALL', degree=>8);
end;
/
the size parameter will set the buckets (if there's less than that number of distinct values, the result will be less).
The above doesn't specify estimate_pct so will sample a smaller population of values rather than 100%. if you want 100 percent then specify this in the estimate_pct parameter*), but if you have a regular script, this may get overwritten later on.
* you can check the current sample size by comparing sample_size on the dba_tab_col_statistics to the num_rows on dba_tables
I want to follow up, DazzaL's answer was very informative, but I was wrong, the buckets did not make a difference. I could have saved a lot of time by running
explain plan for delete blah blah blah
then
select * from table(dbms_xplan.display)
and focusing on the right step id.
I zeroed in on step id 5 where a index fast full scan was being used when I should have started on step 2 where the two database plans were different. I saw the fast database doing nested loop anti join and the slow database doing a merge join anti. Once I added the (undocumented?) hint
NL_AJ
with no parameters in the subquery, the desired fast index_range_scan was used.
As I write this I wonder if the bucket number matching was important, but I don't have the time to retest. It ain't broke, so I will not touch it!
I have a product table which as many columns. Primary key is productid. There 50,000 rows but when I issue a select statement like select * from products then it is taking 10 minutes to get the full data. So advise me what to do as a result I can run my query faster.
Is your primary key also the clustering key on that table?
If you do a SELECT * .... you'll basically always get a full table scan. There's really nothing that can speed that query up - you want all rows, all columns - so you get it all and it takes the time it takes.
If you do more "focused" queries like
SELECT col1, col2 FROM dbo.Products WHERE SomeColumn = 42
then you have a chance of speeding this up by using the appropriate indices.
Buy a better computer.
Seriously.
SQL Server 2000 has been retired years ago, so this is an OLD install. 50.000 products is a joke - any table below 1 million is nothing.
But when i issue a select statement like select * from products then it is taking 10 minute
to get the full data.
Assuming this is over LAN, not over a slow internet connection, there can be 2 reasons for that:
System is TERRIBLY OVERLOADED. Like SERIOUS overloaded. Not like I have not seen that on old setups. Been there, seen that - hard discs so overloaded (hey, they are SCSI, they are fast) that they took more than 2 seconds to answer to a request.
System is programmed by incompetents. Could be bad transaction level handling leading to terrible locks for long duration which block you. This is possible, but then you are in for a LOT of rework to get the ridiculous code out of the programming.
A select * from table should not take more than a couple of seconds to transfer all the data over LAN. Point. Unless the bale has tons of binary data (i.e. HUGH amounts of data in some fields).
As your local database specialist to make an analysis. Start with hardware load then move to locking behavior. Consider upgrading to a technology that is more modern, By now you are a LOT of generations behind.
Because there's no criterium (WHERE), the time your query takes is not due to the selection (determining which rows to select) but most likely due to the sheer size of the data.
The only solution is:
Do not use SELECT *, but select only the columns you need.
I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
...
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
thanks!
There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY data_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY newline
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(
uid INTEGER EXTERNAL(6),
status CHAR(10),
sdate date 'yyyymmdd' )
)
LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
et.uid
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.
I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.
I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
)
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.
I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
else
if row is more desirable than savedrow
save row
end
end
end
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)