Sampling in oracle - oracle

I'm trying to take a sample from a insurance claims database.
For example 20% random, sample from 1 million claims data where provider type is '25' and year is '2012'. Data is in sqldeveloper. I am a statistician with basic SQL knowledge.

You can use SAMPLE to get a random set of rows from a table.
SELECT *
FROM claim SAMPLE(20)
WHERE type ='25'
AND year = 2012;

SQL has a SAMPLE command built in. Example:
SELECT * FROM emp SAMPLE(25)
means each row in emp has a 25% chance of being included in the resulting set. NOTE: this does not mean that exactly 25% of the rows are necessarily selected
this blog was a quick read on more details on sampling

With this you get a single line of a sample that is shown random.
SELECT * FROM TABLE# SAMPLE(10)
FETCH NEXT 1 ROWS ONLY

Related

What's the best practice to filter out specific year in query in Netezza?

I am a SQL Server guy and just started working on Netezza, one thing pops up to me is a daily query to find out the size of a table filtered out by year: 2016,2015, 2014, ...
What I am using now is something like below and it works for me, but I wonder if there is a better way to do it:
select count(1)
from table
where extract(year from datacolumn) = 2016
extract is a built-in function, applying a function on a table with size like 10 billion+ is not imaginable in SQL Server to my knowledge.
Thank you for your advice.
The only problem i see with the query is the where clause which executes a function on the 'variable' side. That effectively disables zonemaps and thus forces netezza to scan all data pages, not only those with data from that year.
Instead write something like:
select count(1)
from table
where datecolumn between '2016-01-01' and '2016-12-31'
A more generic alternative is to create a 'date dimension table' with one row per day in your tables (and a couple of years into the future)
This is an example for Postgres: https://medium.com/#duffn/creating-a-date-dimension-table-in-postgresql-af3f8e2941ac
This enables you to write code like this:
Select count(1)
From table t join d_date d on t.datecolumn=d.date_actual
Where year_actual=2016
You may not have the generate_series() function on your system, but a 'select row_number()...' can do the same trick. A download is available here: https://www.ibm.com/developerworks/community/wikis/basic/anonymous/api/wiki/76c5f285-8577-4848-b1f3-167b8225e847/page/44d502dd-5a70-4db8-b8ee-6bbffcb32f00/attachment/6cb02340-a342-42e6-8953-aa01cbb10275/media/generate_series.tgz
A couple of further notices in 'date interval' where clauses:
Those columns are the most likely candidate for a zonemaps optimization. Add a 'organize on (datecolumn)' at the bottom of your table DDL and organize your table. That will cause netezza to move around records to pages with similar dates, and the query times will be better.
Furthermore you should ensure that the 'distribute on' clause for the table results in an even distribution across data slices of the table is big. The execution of the query will never be faster than the slowest dataslice.
I hope this helps

Get data from database when u onlw know the row number and column name

how to get data from ms access database when u only know the column name and row number ?
example
select empID
from table
where row no is x
One way to get the 15th records is to use the TOP command twice. First, get the 15 records order by id asc. Then take the top 1 record order by id. That assumes you know a field in the record (table) that you can order by.
SELECT TOP 1 * FROM
(
SELECT top 15 *
FROM [Order Details] d
ORDER BY d.[Order Id] asc
) q
ORDER BY d.[Order Id] desc
The above query works fine in MS Access 2007.
I will see if there is some indicator in the MS Access system tables that are not well documented.
While there is a MSysObjects hidden table and LvProp (L-Value Property), it is a long binary data type.
It looks to me that MS Access is storing the data in a binary format. However, there is no DBCC PAGE to view the internal record structure.
In short, I think the solution above using TOP and/or COUNT is the only way to go.

Oracle scalar function in WHERE clause leads to poor performance

I've written a scalar function (DYNAMIC_DATE) that converts a text value to a date/time. For example, DYANMIC_DATE('T-1') (T-1 = today minus 1 = 'yesterday') returns 08-AUG-2012 00:00:00. It also accepts date strings: DYNAMIC_DATE('10/10/1974').
The function makes use of CASE statements to parse the sole parameter and calculate a date relative to sysdate.
While it doesn't make use of any table in its schema, it does make use of TABLE type to store date-format strings:
TYPE VARCHAR_TABLE IS TABLE OF VARCHAR2(10);
formats VARCHAR_TABLE := VARCHAR_TABLE ('mm/dd/rrrr','mm-dd-rrrr','rrrr/mm/dd','rrrr-mm-dd');
When I use the function in the SELECT clause, the query returns in < 1 second:
SELECT DYNAMIC_DATE('MB-1') START_DATE, DYNAMIC_DATE('ME-1') END_DATE
FROM DUAL
If I use it against our date dimension table (91311 total records), the query completes in < 1 second:
SELECT count(1)
from date_dimension
where calendar_dt between DYNAMIC_DATE('MB-1') and DYNAMIC_DATE('ME-1')
Others, however, are having problems with the function if it is used against a larger table (26,301,317 records):
/*
cost: 148,840
records: 151,885
time: ~20 minutes
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between DYNAMIC_DATE('mb-1') and DYNAMIC_DATE('me-1')
However, the same query, using 'hard coded' dates, returns fairly rapidly:
/*
cost: 144,257
records: 151,885
time: 62 seconds
*/
SELECT count(1)
FROM ORDERS ord
WHERE trunc(ord.ordering_date) between to_date('01-JUL-2012','dd-mon-yyyy') AND to_date('31-JUL-2012','dd-mon-yyyy')
The vendor's vanilla installation doesn't include an index on the ORDERING_DATE field.
The explain plans for both queries are similar:
with function:
with hard-coded dates:
Is the DYNAMIC_DATE function being called repeatedly in the WHERE clause?
What else might explain the disparity?
** edit **
A NONUNIQUE index was added to ORDERS table. Both queries execute in < 1 second. Both plans are the same (approach), but the one with the function is lower cost.
I removed the DETERMINISTIC keyword from the function; the query executed in < 1 second.
Is the issue really with the function or was it related to the table?
3 years from now, when this table is even larger, and if I don't include the DETERMINISTIC keyword, will query performance suffer?
Will the DETERMINISTIC keyword have any affect on the function's results? If I run DYNAMIC_DATE('T-1') tomorrow, will I get the same results as if I ran it today (08/09/2012)? If so, this approach won't work.
If the steps of the plan are identical, then the total amount of work being done should be identical. If you trace the session (something simple like set autotrace on in SQL*Plus or something more sophisticated like an event 10046 trace), or if you look at DBA_HIST_SQLSTAT assuming you have licensed access to the AWR tables, are you seeing (roughly) the same amount of logical I/O and CPU consumption for the two queries? Is it possible that the difference in runtime you are seeing is the result of the data being cached when you run the second query?
I am guessing that the problem isn't with your function. Try creating a function based index on trunc(ord.ordering_date) and see the explain plans.
CREATE INDEX ord_date_index ON ord(trunc(ord.ordering_date));

Oracle query returning differnet results for same set of data

I am facing a wierd problem where the same query is returning different results.
My query is:
SELECT * FROM TX_HISTORY WHERE acct = 7 AND ROWNUM
What is happening is that I know that for this account there are more than 100 records in tx_history. I want to get the first 100 records based on the processing date.
My data for this account is I have records from 2004 till 2011
The problem is sometimes it correctly shows the 100 records starting 2004 - but sometimes it shows me 100 records starting 2005
I read that this can be solved by:
SELECT * FROM (select * from TX_HISTORY WHERE acct = 7 ORDER BY acct,processing_date)
where rownum
so in my earlier query is it that the:
1> My understanding is that the order by is being applied after the rownum <= 100 and the results returned by oracle are in a random order on which row num is filtering
Though what is not understood why the results would vary
Thanks,
~akila
If you do not specify any ordering (and in this case, as you already found out, you do not order the data being retrieved, you only sort afterwards), it is up to the database to return them in any order it sees fit.
It could for example just start reading the rows in the order they are stored, which changes as the data gets updated. It also does not have to start from the top of the table, it could start with the blocks already in the buffer cache.
Since you did not specify the order, the DB will choose (what it thinks to be) the least expensive way available to it at this particular moment.
Try this:
select top(100) from ...........
its give top 100 rows which u want.
If you include AND RowNum <= 100 Oracle will pull 100 records at free will. If you put it in
SELECT *
FROM TX_HISTORY
WHERE acct = 7
AND ROWNUM <= 100
ORDER BY acct,processing_date
it is performed on all records there are.
However, if you have
SELECT *
FROM (select *
from TX_HISTORY
WHERE acct = 7
ORDER BY acct,processing_date)
where rownum <= 100
it is performed on the records returned in the sub-select (the SELECT within the ( ). In other words Oracle uses a different set of records to perform AND RowNum <= 100 on.
The ordering is performed on the records returned by the query, so it happens after the WHERE-clause. So you will probably still get varying results.
I hope I could make it clearer.

where rownum=1 query taking time in Oracle

I am trying to execute a query like
select * from tableName where rownum=1
This query is basically to fetch the column names of the table.There are more than million records in the table.When I put the above condition its taking so much time to fetch the first row.Is there any alternate to get the first row.
This question has already been answered, I will just provide an explanation as to why sometimes a filter ROWNUM=1 or ROWNUM <= 1 may result in a long response time.
When encountering a ROWNUM filter (on a single table), the optimizer will produce a FULL SCAN with COUNT STOPKEY. This means that Oracle will start to read rows until it encounters the first N rows (here N=1). A full scan reads blocks from the first extent to the high water mark. Oracle has no way to determine which blocks contain rows and which don't beforehand, all blocks will therefore be read until N rows are found. If the first blocks are empty, it could result in many reads.
Consider the following:
SQL> /* rows will take a lot of space because of the CHAR column */
SQL> create table example (id number, fill char(2000));
Table created
SQL> insert into example
2 select rownum, 'x' from all_objects where rownum <= 100000;
100000 rows inserted
SQL> commit;
Commit complete
SQL> delete from example where id <= 99000;
99000 rows deleted
SQL> set timing on
SQL> set autotrace traceonly
SQL> select * from example where rownum = 1;
Elapsed: 00:00:05.01
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=7 Card=1 Bytes=2015)
1 0 COUNT (STOPKEY)
2 1 TABLE ACCESS (FULL) OF 'EXAMPLE' (TABLE) (Cost=7 Card=1588 [..])
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
33211 consistent gets
25901 physical reads
0 redo size
2237 bytes sent via SQL*Net to client
278 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
As you can see the number of consistent gets is extremely high (for a single row). This situation could be encountered in some cases where for example, you insert rows with the /*+APPEND*/ hint (thus above high water mark), and you also delete the oldest rows periodically, resulting in a lot of empty space at the beginning of the segment.
Try this:
select * from tableName where rownum<=1
There are some weird ROWNUM bugs, sometimes changing the query very slightly will fix it. I've seen this happen before, but I can't reproduce it.
Here are some discussions of similar issues: http://jonathanlewis.wordpress.com/2008/03/09/cursor_sharing/ and http://forums.oracle.com/forums/thread.jspa?threadID=946740&tstart=1
Surely Oracle has meta-data tables that you can use to get column names, like the sysibm.syscolumns table in DB2?
And, after a quick web search, that appears to be the case: see ALL_TAB_COLUMNS.
I'd use those rather than go to the actual table, something like (untested):
SELECT COLUMN_NAME
FROM ALL_TAB_COLUMNS
WHERE TABLE_NAME = "MYTABLE"
ORDER BY COLUMN_NAME;
If you are hell-bent on finding out why your query is slow, you should revert to the standard method: asking your DBMS to explain the execution plan of the query for you. For Oracle, see section 9 of this document.
There's a conversation over at Ask Tom - Oracle that seems to suggest the row numbers are created after the select phase, which may mean the query is retrieving all rows anyway. The explain will probably help establish that. If it contains FULL without COUNT STOPKEY, then that may explain the performance.
Beyond that, my knowledge of Oracle specifics diminishes and you will have to analyse the explain further.
Your query is doing a full table scan and then returning the first row.
Try
SELECT * FROM table WHERE primary_key = primary_key_value;
The first row, particularly as it pertains to ROWNUM, is arbitrarily decided by Oracle. It may not be the same from query to query, unless you provide an ORDER BY clause.
So, picking a primary key value to filter by is as good a method as any to get a single row.
I think you're slightly missing the concept of ROWNUM - according to Oracle docs: "ROWNUM is a pseudo-column that returns a row's position in a result set. ROWNUM is evaluated AFTER records are selected from the database and BEFORE the execution of ORDER BY clause."
So it returns ANY row that it consideres #1 in the result set which in your case will contain 1M rows.
You may want to check out a ROWID pseudo-column: http://psoug.org/reference/pseudocols.html
I've recently had the same problem you're describing: I want one row from the very large table as a quick, dirty, simple introspection, and "where rownum=1" alone behaves very poorly. Below is a remedy which worked for me.
Select the max() of the first term of some index, and then use it to choose some small fraction of all rows with "rownum=1". Suppose my table has some index on numerical "group-id", and compare this:
select * from my_table where rownum = 1;
-- Elapsed: 00:00:23.69
with this:
select * from my_table where rownum = 1
and group_id = (select max(group_id) from my_table);
-- Elapsed: 00:00:00.01

Resources