hive join with like operator

hive join with like operator - hadoop

I have two tables which are using ORC compression and am using TEZ as execution engine. Table_a contains more than 900k records and table_b contains 17 million records. This query taking longer time I have waited for 2 days but the query execution was not completed. what am I doing wrong in this query.
select min(up.id) as comp002uniqueid, min(cp.product_id) as p_id
from
(select * from table_a where u_id is null) up , table_b cp
where cp.title like concat('% ',up.productname,' %')
group by up.productname;

Related

Need inputs to delete bulk data faster

Database: Oracle 11g
Scenario:
TABLE_A has around 50 million records
TABLE_A has COLUMN_A, COLUMN_B, COLUMN_C, COLUMN_D, COLUMN_E
COLUMN_A is the primary key of TABLE_A
We need to delete around 30 million records from TABLE_A
So, we created another Table TABLE_B
TABLE_B has COLUMN_A with all the GUIDs qualifying for delete from TABLE_A based on TABLE_A.COLUMN_A
TABLE_B has another column QUALIFIER which is populated with sequence starting from 1 to max count of records, say 30 million
TABLE_B is also range based partitioned based on the QUALIFIER column. Each range is split in 3 million records
Which approach among the below would be most efficient way of deleting the records given the above scenario. We are planning to perform this task over a weekend with minimal downtime and also to avoid any segment space issues due to bulk delete:
Approach-I: Use the direct delete statement without any conditions as follows
Delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B)
Also, can we use parallel hints to improve the performance:
Delete /*+ parallel first_rows*/ from TABLE_A where COLUMN_A in (select /*+ parallel first_rows*/ COLUMN_A from TABLE_B);
Approach-II: Delete the records from TABLE_A by splitting the data based on QUALIFIER column count range to avoid segment space issues if any. And also the records can be deleted in iterations.
Delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B where QUALIFIER between 1 and 300000);
Delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B where QUALIFIER between 300001 and 600000);
Delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B where QUALIFIER between 600001 and 900000);
etc, until
Delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B where QUALIFIER between 2700001 and 3000000);
Also, can we use parallel hints in the above delete statements to improve the performance
Approach-III: Delete the records from TABLE_A by splitting the data based on the partitions on the TABLE_B
delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B PARTITION (TABLE_B_PARTITION_1));
delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B PARTITION (TABLE_B_PARTITION_2));
delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B PARTITION (TABLE_B_PARTITION_3));
etc, until
delete from TABLE_A where COLUMN_A in (select COLUMN_A from TABLE_B PARTITION (TABLE_B_PARTITION_10));
Also, can we use parallel hints in the above delete statements to improve the performance
Is there any other better approach to follow for the above mentioned scenario?

Using cache or temporary tables for optimising a query with many union all

I need to query many tables in A Oracle database in a single query for a typeahead function, which means I need to query for each keypress on a homepage.
select * from
(select * from table1
union all
select * from table2
union all
select * from table3
union all
select * from table4
)
where colume_name like '%xxx%'
All the union will result in a dataset of around 300000 records.
Is there anyway, this query can be optimized, so Oracle somehow keep the united dataset in a cache or temporary table that can be used for the next keypress?

As you said - you could create a (global) temporary table with on commit preserve rows option, store the result in there and use data during that session.

postgresql query plan strange behaviour

I must select the rows in one table (table_A fairly small < 10K rows) that have no corresponding rows in another table (table_B over 500K rows) based on column_X (table_B has an index, btree, on that column).
If I use the following query:
select a.column1,
a.column2,
a.column3,
a.column_X,
b.column_X
from table_A a
left outer join table_B b on a.column_X = b.column_X
where a.column_X <> 0
and b.column_X is null
the query (168 resulting rows) is executed in about 600ms.
If, on the other hand, I try a different query:
select column1,
column2,
column3,
column_X
from table_A
where column_X not in (
select column_X
from table_B
where column_X is not null
)
and column_X <> 0
it takes around 8 minutes to retrieve the same 168 rows. column_X is of type bigint and casting seems to make no difference (in the second query the index is never used).
Any idea?

The NOT IN subselect is much worse optimized than any other. Due different semantic PostgreSQL cannot to use antijoin. If you can, don't use this pattern. Use NOT EXISTS instead, or outer join.

Convert rownum from Oracle in Postgres, in "having" clause

I need to convert a query from Oracle SQL to Postgres.
select count(*) from table1 group by column1 having max(rownum) = 4
If I replace "rownum" with "row_number() over()", I have an error message: "window functions are not allowed in HAVING".
Could you help me to get the same result in Postgres, as in Oracle?

The query below will do what your Oracle query is doing.
select count(*) from
(select column1, row_number() over () as x from table1) as t
group by column1 having max(t.x) = 6;
However
Neither oracle not postgres will guarantee the order in which records are read unless you specify an order by clause. So running the query multiple times is going to be inconsistent depending on how the database decides to process the query. Certainly in postgres any updates will change the underlying row order.
In the example below I've got an extra column of seq which is used to provide a consistent sort.
CREATE TABLE table1 (column1 int, seq int);
insert into table1 values (0,1),(0,2),(0,3),(1,4),(0,5),(1,6);
And a revised query which forces the order to be consistent:
select count(*) from
(select column1, row_number() over (order by seq) as x from table1) as t
group by column1 having max(t.x) = 6;

Why delete in Oracle takes more time than select also in case 0 rows?

I run simple query:
select * from tableA where fk in (select pk from tableB where column='somevalue')
It returns 0 rows and takes time near 20 seconds
But when I run:
delete from tableA where fk in (select pk from tableB where column='somevalue')
It takes time near 3 minutes and produces error: cannot extend temp segment!
But it is nonsense: delete query deletes nothing!
How to workaround this?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

hive join with like operator - hadoop

Related

Need inputs to delete bulk data faster

Using cache or temporary tables for optimising a query with many union all

postgresql query plan strange behaviour

Convert rownum from Oracle in Postgres, in "having" clause

Why delete in Oracle takes more time than select also in case 0 rows?

Categories

Resources