I am trying to pull a random sample of a population from a Peoplesoft Database. The searches online have lead me to think that the Sample Clause of the select statement may be a viable option for us to use, however I am having trouble understanding how the Sample clause determines the number of samples returned. I have looked at the oracle documentation found here:
http://docs.oracle.com/cd/E11882_01/server.112/e26088/statements_10002.htm#i2065953
But the above reference only talks about the syntax used to create the sample. The reason for my question is I need to understand how the sample percent determines the sample size returned. It seems like it applies a random number to the percent you ask for and then uses a seed number to count every "n" records. Our requirement is that we pull an exact number of samples for example, that they are randomly selected, and that they are representative of the entire table (or at least the grouping of data we choose with filters)
In a population of 10200 items if I need a sample of approximately 100 items, I could use this statement:
SELECT * FROM PS_LEDGER SAMPLE(1) --1 % of my total population
WHERE DEPTID = '700064'
However, We need to pull an exact number of samples (in this case 100) so I could pick a sample size that almost always returns more than the number I need then trim it down IE
SELECT Count(*) FROM PS_LEDGER SAMPLE(2.5) --this percent must always give > 100 items
WHERE DEPTID = '700064' and rownum < 101
My concern with doing that, is that my sample would not uniformly represent the entire population. For example if the sample function just pulls every N record after it creates its own randomly generated seed, then choosing the rownum < 101 will cut off all of the records chosen from the bottom of the table. What I am looking for is a way to pull out exactly 100 records from the table, which are randomly selected and fairly representative of the entire table. Please help!!
Borrowing jonearles' example table, I see exactly the same thing (in 11gR2 on an OEL developer image), usually getting values for a heavily skewed towards 1; with small sample sizes I can sometimes see none at all. With the extra randomisation/restriction step I mentioned in a comment:
select a, count(*) from (
select * from test1 sample (1)
order by dbms_random.value
)
where rownum < 101
group by a;
... with three runs I got:
A COUNT(*)
---------- ----------
1 71
2 29
A COUNT(*)
---------- ----------
1 100
A COUNT(*)
---------- ----------
1 64
2 36
Yes, 100% really came back as 1 on the second run. The skewing itself seems to be rather random. I tried with the block modifier which seemed to make little difference, perhaps surprisingly - I might have thought it would get worse in this situation.
This is likely to be slower, certainly for small sample sizes, as it has to hit the entire table; but does give me pretty even splits fairly consistently:
select a, count(*) from (
select a, b from (
select a, b, row_number() over (order by dbms_random.value) as rn
from test1
)
where rn < 101
)
group by a;
With three runs I got:
A COUNT(*)
---------- ----------
1 48
2 52
A COUNT(*)
---------- ----------
1 57
2 43
A COUNT(*)
---------- ----------
1 49
2 51
... which looks a bit healthier. YMMV of course.
This Oracle article covers some sampling techniques, and you might want to evaluate the ora_hash approach as well, and the stratified version if your data spread and your requirements for 'representativeness' demand it.
You can't trust SAMPLE to return a truly random set of rows from a table. The algorithm appears to be based on the physical properties of the table.
create table test1(a number, b char(2000));
--Insert 10K fat records. A is always 1.
insert into test1 select 1, level from dual connect by level <= 10000;
--Insert 10K skinny records. A is always 2.
insert into test1 select 2, null from dual connect by level <= 10000;
--Select about 10 rows.
select * from test1 sample (0.1) order by a;
Run the last query multiple times and you will almost never see any 2s. This may be a accurate sample if you measure by bytes, but not by rows.
This is an extreme example of skewed data, but I think it's enough to show that RANDOM doesn't work the way the manual implies it should. As others have suggested, you'll probably want to ORDER BY DBMS_RANDOM.VALUE.
I've been fiddling about with a similar question. Firstly I set up what the sample size will be for the different Stratum. In your case it's only one. ('700064'). So in a with Clause or a temp table I did this:
Select DEPTID, Count(*) SAMPLE_ONE
FROM PS_LEDGER Sample(1)
WHERE DEPTID = '700064'
Group By DEPTID
This tells you the records in a 1% sample to expect. Lets call that TABLE_1
Then I did this:
Select
Ceil (Rank() over (Partition by DEPTID Order by DBMS_RANDOM.VALUE)
/ (Select SAMPLE_ONE From TABLE_1) STRATUM_GROUP
,A.*
FROM PS_LEDGER
Make that another table. What you get then is Random Sample Sets of approx. 1% in size.
So if your original table held 1000 records you would get 100 random sample sets with 10 items in each set.
you can then select one of these randomly to test.
Not sure if I've explained this very well, but it worked for me. I had 168 Stratum Set up on a table with over 10Mil records worked quite well.
If you want more explanation or can improve this please don't hesitate.
Regards
Related
I maintain a table in Oracle that contains several hundred thousand lines of code, including a priority column, which indicates for each line its importance according to the needs of the system.
ID
BRAND
COLOR
VALUE
SIZE
PRIORITY
EFFECTIVE_DATE_FROM
EFFECTIVE_DATE_FROM
1
BL
BLUE
58345
12
1
10/07/2022
NULL
2
TK
BLACK
4455
1
1
10/07/2022
NULL
3
TK
RED
16358
88
2
11/01/2022
NULL
4
WRA
RED
98
10
6
18/07/2022
NULL
5
BL
BLUE
20942
18
7
02/06/2022
NULL
At any given moment thousands more rows may enter the table, and it is necessary to SELECT from it the 1000 rows with the highest priority.
Although the naive solution is to SELECT using ORDER BY PRIORITY ASC, we find that the process takes a long time when the table contains a very large amount of rows (say over 2 million records).
The solutions proposed so far are to divide the table into 2 different tables, so that in advance the records with priority 1 will be entered into Table A, and the rest of the records will be entered in Table B, and then we will SELECT using UNION between the two tables.
This way we save the ORDER BY process since Table A always contains priority 1, and it remains to sort only the data in Table B, which is expected to be smaller than the overall large table.
On the other hand, it was also suggested to leave the large table in place, and perform the SELECT using PARTITION BY on the priority column.
I searched the web for differences in speed or efficiency between the two options but did not find one, and I am debating how to do it. So which of the options is preferable if we focus on the efficiency and complexity of time?
Given a set of bit sequences, what's the quickest way to find the ones which contain a given number of 0's? Is there a bitwise operation/mask operation I can use for this?
Details:
All the sequences of bits have the same length (287).
when I say "quickest" I mean performance wise, not the quickest to write/maintain.
These sequences will actually be stored in individual records in Oracle and SQL Server, and this operation will probably be executed in a query, but I think I can implement it, once I understand the logic.
Further details:
This is a way to find availability in a booking system with performance constraints. The solution I thought was to store the availability as strings of bits or a number composed by 1's and 0's, each bit representing 5 minutes intervals.
When I have to find a slot of 30 minutes, I need to find 6 consecutive 0's. If you have better ideas, I would be very interested in exploring them.
Apart from my ranting as a comment, see whether a simple INSTR helps. Sample data (didn't feel like typing 287 bits) from lines #1 - 4; query itself begins at line #5.
SQL> with test (id, col) as
2 (select 1, '1011100001101001' from dual union all
3 select 2, '1010101000000110' from dual
4 )
5 select *
6 from test
7 where instr(col, '000000') > 0;
ID COL
---------- ----------------
2 1010101000000110
SQL>
I have a query similar to this
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
The huge_table is partitioned by DATE, and the PK is DATE, some_id and some_other_id (so the join not is done by pk index).
small_table just contains a few dates.
The total cost of the SQL is 48 minutes
By some reason the explain plan give me a "PARTITION RANGE (ALL)" with a high numbers on cardinality. Looks like access to the full table, not just the partitions indicated by small_table.DATE
If I put the SQL inside a loop and do
for o in (select date from small_table)
loop
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
where B.DATE=O.DATE
end loop;
Only takes 2 minutes 40 seconds (the full loop).
There is any way to force the partition pruning on Oracle 12c?
Additional info:
small_table has 37 records for 13 different dates. huge_table has 8,000 million of records with 179 dates/partitions. The SQL needs one field from small_table, but I can tweak the SQL to not use it
Update:
With the use_nl hint, now the cardinality show in the execution plan is more accurate and the execution time downs from 48 minutes to 4 minutes.
select /* use_nl(B) */*
from small_table A
inner join huge_table B on A.DATE =B.DATE
This seems like the problem:
"small_table have 37 registries for 13 different dates. huge_table has 8.000 millions of registries with 179 dates/partitions....
The SQL need one field from small_table, but I can tweak the SQL to not use it "
According to the SQL you posted you're joining the two tables on just their DATE columns with no additional conditions. If that's really the case you are generating a cross join in which each partition of huge_table is joined to small_table 2-3 times. So your result set may be much large than you're expecting, which means more database effort, which means more time.
The other thing to notice is that the cardinality of small_table to huge_table partitions is about 1:4; the optimizer doesn't know that there are really only thirteen distinct huge_table partitions in play.
Optimization ought to be a science and this is more guesswork than anything but try this:
select B.*
from ( select /*+ cardinality(t 13) */
distinct t.date
from small_table t ) A
inner join huge_table B
on A.DATE =B.DATE
This should communicate to the optimizer that only a small percentage of the huge_table partitions are required, which may make it choose partition pruning. Also it removes that Cartesian product, which should improve performance too. Obviously you will need to apply that tweak you mentioned, to remove the need to query anything else from small_table.
I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?
As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.
There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.
For example, sampling trivially from a table with 8 rows:
> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id) |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s
(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)
Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.
In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.
TABLESAMPLE mentioned in other answers is now available in newer versions of impala (>=2.9.0), see documentation.
Here's an example of how you could use it to sample 1% of your data:
SELECT foo FROM huge_table TABLESAMPLE SYSTEM(1)
or
SELECT bar FROM huge_table TABLESAMPLE SYSTEM(1) WHERE name='john'
Looks like percentage argument must be an integer, so the smallest sample you can take is limited to 1%.
Keep in mind that the proportion of sampled data from the table is not guaranteed and may be greater than the specified percentage (in this case more than 1%). This is explained in greater detail in Impala's documentation.
If you are looking for sample over certain column(s), you can check below answer.
Say, you have global data and you want to pick 10% from them randomly and create your dataset. You can use any combination of columns too - like city, zip code and state.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 10/100 -- This is for 10% data
Link -
Randomly sampling n rows in impala using random() or tablesample system()
I have a column defined as Number (10,3) and this column is indexed. I was wondering if I convert this column to be an integer, will the index perform better on it. I will have to multiple by 10^7 and do a divide by 10^7 in my code for it. But i don't know if it is necessary?
Thanks,
It's almost certainly not going to be an appreciable difference.
The index may be slightly more compact because the integer representations may be slightly smaller than the fixed point representation. For example, the number 1234 requires 3 bytes of storage while the number 1.234 requires 4 bytes of storage. Occasionally, the reverse will be true and the fixed point value will require less storage, but it's 100 times more likely that the integer representation will be smaller than the reverse. You can see that yourself by populating a table with the first 1 million integers and the first million integers divided by 1000.
SQL> create table int_test( int_col number(38,0), fixed_col number(10,3) );
Table created.
SQL> insert into int_test
2 select level, level/1000
3 from dual
4 connect by level <= 1000000;
1000000 rows created.
SQL> select sum(vsize(int_col)) int_col_total_size,
2 sum(vsize(fixed_col)) fixed_col_total_size
3 from int_test;
INT_COL_TOTAL_SIZE FIXED_COL_TOTAL_SIZE
------------------ --------------------
3979802 4797983
SQL> ed
Wrote file afiedt.buf
1 select count(*) int_larger_than_fixed
2 from int_test
3* where vsize(int_col) > vsize(fixed_col)
SQL> /
INT_LARGER_THAN_FIXED
---------------------
8262
SQL> ed
Wrote file afiedt.buf
1 select count(*) fixed_larger_than_int
2 from int_test
3* where vsize(int_col) < vsize(fixed_col)
SQL> /
FIXED_LARGER_THAN_INT
---------------------
826443
While the index is going to be slighly more compact, that would only come into play if you're doing some extensive range scans or fast full scans on the index structure. It is very unlikely that there would be fewer levels in the index on the integer values so single-row lookups would require just as much I/O. And it is pretty rare that you'd want to do large scale range scans on an index. The fact that the data is more compact may also tend to increase contention on certain blocks.
My guess, therefore, is that the index would use slightly less space on disk but that you'd be hard-pressed to notice a performance difference. And if you're doing additional multiplications and divisions every time, the extra CPU that will consume is likely to cancel whatever marginal I/O benefit you might get. If your application happens to do a lot more index fast full scans than the average, you might see some reduced I/O.