Performance of query without using OR clause - performance

I am facing a problem. I have one query
Select * from tabA
where (a) in (a,b,c)
OR b in (a,b,c)
I want to facing performance issue due to this query as I need to remove the or condition , so I tried with the following query:
Select * from tabA
where (a,b) in (a,b,c)
but this query seems not to work, please help. I dont want to use 'or' condition.

If you logically need the OR condition, then that is what you need. There is nothing wrong with using OR. If both columns are indexed then the query is likely to take no longer than running these 2 queries independently:
select * from tabA
where a in (a,b,c);
select * from tabA
where b in (a,b,c);
The optimizer may well do that and concatenate the results like this:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=2 Bytes=256)
1 0 CONCATENATION
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'TABA' (Cost=2 Card=1 Bytes=128)
3 2 INDEX (RANGE SCAN) OF 'TABA_A_IDX' (NON-UNIQUE) (Cost=1 Card=1)
4 1 TABLE ACCESS (BY INDEX ROWID) OF 'TABA' (Cost=2 Card=1 Bytes=128)
5 4 INDEX (UNIQUE SCAN) OF 'TABA_B_IDX' (NON-UNIQUE) (Cost=1 Card=1)

if the logic remains the same - you may try a UNION
Select * from tabA
where (a) in (a,b,c)
union
Select * from tabA
where b in (a,b,c)
also, check your indexes and explain plan results - indexing may solve the original OR issues.

You use wrong syntax, if you want pair compare values you should use smth like this:
select * from tabA
where (a,b) in ((a,b), (a,c), (b,c) etc.
Anyway in condition is transformed to multiple or conditions during query execution.
Provided you show table structure and execution plan people will be able to help you more effectively.

Related

How to Improve Cross Join Performance in Hive TEZ?

I have a hive table with 5 billion records. I want each of these 5 billion records to be joined with a hardcoded 52 records.
For achieving this I am doing a cross join like
select *
from table1 join table 2
ON 1 = 1;
This is taking 5 hours to run with the highest possible memory parameters.
Is there any other short or easier way to achieve this in less time ?
Turn on map-join:
set hive.auto.convert.join=true;
select *
from table1 cross join table2;
The table is small (52 records) and should fit into memory. Map-join operator will load small table into the distributed cache and each reducer container will use it to process data in memory, much faster than common-join.
Your query is slow because a cross-join(Cartesian product) is processed by ONE single reducer. The cure is to enforce higher parallelism. One way is to turn the query into an inner-join, so as to utilize map-side join optimization.
with t1 as (
selct col1, col2,..., 0 as k from table1
)
,t2 as (
selct col3, col4,..., 0 as k from table2
)
selct
*
from t1 join t2
on t1.k = t2.k
Now each table (CTE) has a fake column called k with identical value 0. So it works just like a cross-join while only a map-side join operation takes place.

Oracle SELECT * FROM LARGE_TABLE - takes minutes to respond

So I have a simple table with 5 or so columns, one of which is a clob containing some JSON data.
I am running
SELECT * FROM BIG_TABLE
SELECT * FROM BIG_TABLE WHERE ROWNUM < 2
SELECT * FROM BIG_TABLE WHERE ROWNUM = 1
SELECT * FROM BIG_TABLE WHERE ID=x
I expect that any fractionally intelligent relational database would return the data immediately. We are not imposing order by/group by clauses, so why not return the data as and when you find it?
Of all the forms of SELECT statements above, only 4. returned in a sub-second manner. This is unexpected for 1-3 which are returning between 1 and 10 minutes before the query shows any responses in SQL Developer. SQL Developer has the standard SQL Array Fetch Size of 50 (JDBC Fetch size of 50 rows) so at a minimum, it is taking 1-10 minutes to return 50 rows from a simple table with no joins on a super high-performance RAC cluster backed by fancy 4-tiered EMC disk subsystem.
Explain plans show a table scan. Fine, but why should I wait 1-10 minutes for the results with rownum in the WHERE clause?
What is going on here?
OK - I found the issue. ROWNUM does not operate like I thought it did and in the code above it never stops the full table scan.
This is because:
RowNum is assigned during the predicate operation (where clause evaluation) and incremented afterwards, i.e.: your row makes it into the result set and then gets rownum assigned.
In order to filter by rownum you need to already have it exist, something like ...
SELECT * FROM (SELECT * FROM BIG_TABLE) WHERE ROWNUM < 1
In effect what this means is that there is no way to filter out the top 5 rows from a table without having first selected the entire table if no other filter criteria are involved.
I solved my problem like this...
SELECT * FROM (SELECT * FROM BIG_TABLE WHERE
DATE_COL BETWEEN :Date1 AND :Date2) WHERE ROWNUM < :x;

Comparing Similar Hive Tables

I have two hive tables (t1 and t2) that I would like to compare. The second table has 5 additional columns that are not in the first table. Other than the five disjoint fields, the two tables should be identical. I am trying to write a query to check this. Here is what I have so far:
SELECT * FROM t1
UNION ALL
select * from t2
GROUP BY some_value
HAVING count(*) == 2
If the tables are identical, this should return 0 records. However, since the second table contains 5 extra fields, I need to change the second select statement to reflect this. There are almost 60 column names so I would really hate to write it like this:
SELECT * FROM t1
UNION ALL
select field1, field2, field3,...,fieldn from t2
GROUP BY some_value
HAVING count(*) == 2
I have looked around and I know there is no select * EXCEPT syntax, but is there a way to do this query without having to explicity name each column that I want included in the final result?
You should have used UNION DISTINCT for the logic you are applying.
However, the number and names of columns returned by each select_statement have to be the same otherwise a schema error is thrown.
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
To skip the 5 extra fields, you could use the "--ignore-columns" option.

How to get records randomly from the oracle database?

I need to select rows randomly from an Oracle DB.
Ex: Assume a table with 100 rows, how I can randomly return 20 of those records from the entire 100 rows.
SELECT *
FROM (
SELECT *
FROM table
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum < 21;
SAMPLE() is not guaranteed to give you exactly 20 rows, but might be suitable (and may perform significantly better than a full query + sort-by-random for large tables):
SELECT *
FROM table SAMPLE(20);
Note: the 20 here is an approximate percentage, not the number of rows desired. In this case, since you have 100 rows, to get approximately 20 rows you ask for a 20% sample.
SELECT * FROM table SAMPLE(10) WHERE ROWNUM <= 20;
This is more efficient as it doesn't need to sort the Table.
SELECT column FROM
( SELECT column, dbms_random.value FROM table ORDER BY 2 )
where rownum <= 20;
In summary, two ways were introduced
1) using order by DBMS_RANDOM.VALUE clause
2) using sample([%]) function
The first way has advantage in 'CORRECTNESS' which means you will never fail get result if it actually exists, while in the second way you may get no result even though it has cases satisfying the query condition since information is reduced during sampling.
The second way has advantage in 'EFFICIENT' which mean you will get result faster and give light load to your database.
I was given an warning from DBA that my query using the first way gives loads to the database
You can choose one of two ways according to your interest!
In case of huge tables standard way with sorting by dbms_random.value is not effective because you need to scan whole table and dbms_random.value is pretty slow function and requires context switches. For such cases, there are 3 additional methods:
1: Use sample clause:
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/SELECT.html#GUID-CFA006CA-6FF1-4972-821E-6996142A51C6
for example:
select *
from s1 sample block(1)
order by dbms_random.value
fetch first 1 rows only
ie get 1% of all blocks, then sort them randomly and return just 1 row.
2: if you have an index/primary key on the column with normal distribution, you can get min and max values, get random value in this range and get first row with a value greater or equal than that randomly generated value.
Example:
--big table with 1 mln rows with primary key on ID with normal distribution:
Create table s1(id primary key,padding) as
select level, rpad('x',100,'x')
from dual
connect by level<=1e6;
select *
from s1
where id>=(select
dbms_random.value(
(select min(id) from s1),
(select max(id) from s1)
)
from dual)
order by id
fetch first 1 rows only;
3: get random table block, generate rowid and get row from the table by this rowid:
select *
from s1
where rowid = (
select
DBMS_ROWID.ROWID_CREATE (
1,
objd,
file#,
block#,
1)
from
(
select/*+ rule */ file#,block#,objd
from v$bh b
where b.objd in (select o.data_object_id from user_objects o where object_name='S1' /* table_name */)
order by dbms_random.value
fetch first 1 rows only
)
);
To randomly select 20 rows I think you'd be better off selecting the lot of them randomly ordered and selecting the first 20 of that set.
Something like:
Select *
from (select *
from table
order by dbms_random.value) -- you can also use DBMS_RANDOM.RANDOM
where rownum < 21;
Best used for small tables to avoid selecting large chunks of data only to discard most of it.
Here's how to pick a random sample out of each group:
SELECT GROUPING_COLUMN,
MIN (COLUMN_NAME) KEEP (DENSE_RANK FIRST ORDER BY DBMS_RANDOM.VALUE)
AS RANDOM_SAMPLE
FROM TABLE_NAME
GROUP BY GROUPING_COLUMN
ORDER BY GROUPING_COLUMN;
I'm not sure how efficient it is, but if you have a lot of categories and sub-categories, this seems to do the job nicely.
-- Q. How to find Random 50% records from table ?
when we want percent wise randomly data
SELECT *
FROM (
SELECT *
FROM table_name
ORDER BY DBMS_RANDOM.RANDOM)
WHERE rownum <= (select count(*) from table_name) * 50/100;

SQL/Oracle: when indexes on multiple columns can be used

If I create an index on columns (A, B, C), in that order, my understanding is that the database will be able to use it even if I search only on (A), or (A and B), or (A and B and C), but not if I search only on (B), or (C), or (B and C). Is this correct?
There are actually three index-based access methods that Oracle can use when a predicate is placed on a non-leading column of an index.
i) Index skip-scan: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#PFGRF10105
ii) Fast full index scan: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#i52044
iii) Index full scan: http://download.oracle.com/docs/cd/B19306_01/server.102/b14211/optimops.htm#i82107
I've most often seen the fast full index scan "in the wild", but all are possible.
That is not correct. Always best to come up with a test case that represents your data and see for yourself. If you want to really understand the Oracle SQL Optimizer google Jonathan Lewis, read his books, read his blog, check out his website, the guy is amazing, and he always generates test cases.
create table mytab nologging as (
select mod(rownum, 3) x, rownum y, mod(rownum, 3) z from all_objects, (select 'x' from user_tables where rownum < 4)
);
create index i on mytab (x, y, z);
exec dbms_stats.gather_table_stats(ownname=>'DBADMIN',tabname=>'MYTAB', cascade=>true);
set autot trace exp
select * from mytab where y=5000;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=1 Card=1 Bytes=10)
1 0 INDEX (SKIP SCAN) OF 'I' (INDEX) (Cost=1 Card=1 Bytes=10)
Up to version Oracle 8 an index will never be used unless the first column is included in the SQL.
In Oracle 9i the Skip Scan Index Access feature was introduced, which lets the Oracle CBO attempt to use indexes even when the prefix column is not available.
Good overview of how skip scan works here: http://www.quest-pipelines.com/newsletter-v5/1004_C.htm

Resources