Dynamically load partitions in Hive with predicate pushdown - hadoop

I have a very large table in Hive, from which we need to load a subset of partitions. It looks something like this:
CREATE EXTERNAL TABLE table1 (
col1 STRING
) PARTITIONED BY (p_key STRING);
I can load specific partitions like this:
SELECT * FROM table1 WHERE p_key = 'x';
with p_key being the key on which table1 is partitioned. If I hardcode it directly in the WHERE clause, it's all good. However, I have another query which calculates which partitions I need. It's more complicated than this, but let's define it simply as:
SELECT DISTINCT p_key FROM table2;
So now I should be able to construct a dirty query like this:
SELECT * FROM table1
WHERE p_key IN (SELECT DISTINCT p_key FROM table2);
Or written as an inner join:
SELECT t1.* FROM table1 t1
JOIN (SELECT DISTINCT p_key FROM table2) t2 ON t1.p_key = t2.p_key
However, when I run this, it takes enough time to let me believe it's doing a full table scan. In the explain for the above queries, I can also see the result of the DISTINCT operation are used in the reducer, not the mapper, meaning it would be impossible for the mapper to know which partitions should be loaded or not. Granted, I'm not fully familiar with Hive explain output, so I may be overlooking something.
I found this page: MapJoin and Partition Pruning on the Hive wiki and the corrosponding ticket indicates it was released in version 0.11.0. So I should have it.
Is it possible to do this? If so, how?

I'm not sure how to help with MapJoin, but in the worst case you could dynamically create second query with something like:
SELECT concat('SELECT * FROM table1 WHERE p_key IN (',
concat_ws(',',collect_set(p_key)),
')')
FROM table2;
then execute obtained result. With this, query processor should be able to prune unneeded partitions.

Related

Why reading from some table in Oracle slower than other table in the same database

I am doing a simple Select col1, col2, col22 from Table1 order by col1 and the same Select statement in Table2. Select col1, col2, col22 from Table2 order by col1.
I use Pentaho ETL tool to replicate data from Oracle 19c to SQL Server. Reading from Table1 is much much slower than reading from Table2. Both have almost the same number for columns and almost the same number for rows. Both exist in the same schema. Table1 is being read at 10 rows per sec while Table2 is being read at 1000 rows a sec.
What can cause this slowness?
Are the indexes the same on the two tables? It's possible Oracle is using a fast full index scan (like a skinny version of the table) if an index covers all the relevant columns in one table, or may be using a full index scan to pre-sort by COL1. Check the execution plans to make sure the statements are using the same access methods:
explain plan for select ...;
select * from table(dbms_xplan.display);
Are the table segment sizes the same? Although the data could be the same, occasionally a table can have a lot of wasted space. For example, if the table used to contain a billion rows, and then 99.9% of the rows were deleted, but the table was never rebuilt. Compare the segment sizes with a query like this:
select segment_name, sum(bytes)/1024/1024 mb
from all_segments
where segment_name in ('TABLE1', 'TABLE2')
It depends on many factors.
The first things I would check are the table indexes:
select
uic.table_name,
uic.index_name,
utc.column_name
from USER_TAB_COLUMNS UTC,
USER_IND_COLUMNS UIC
where utc.table_name = uic.table_name
and utc.column_name = uic.column_name
and utc.table_name in ('TABLE1', 'TABLE2')
order by 1, 2, 3;

How to make insert statement re-runnable?

Need to add two following insert statements:
insert into table1(schema, table_name, table_alias)
values ('ref_owner','test_table_1','tb1');
insert into table1(schema, table_name, table_alias)
values ('dba_owner','test_table_2','tb2');
Question is how can I make those two insert statements re-runnable meaning, if those two insert statement are compiled again, it should throw row exists error or something along those lines...?
Additional notes:
1. I've seen examples of Merge in Oracle however, thats only when you're using two tables to match records. In this case im only using a single table.
2. The table does not have any primary, unique or foreign keys - only check constraints on one of the columns.
Any help is highly appreciated.
You can use a MERGE statement, as follows:
MERGE into table1 t1
USING (SELECT 'ref_owner' AS SCHEMA_NAME, 'test_table_1' AS TABLE_NAME, 'tb1' AS ALIAS_NAME FROM DUAL
UNION ALL
SELECT 'dba_owner', 'test_table_2', 'tb2' FROM DUAL) d
ON (t1.SCHEMA = d.SCHEMA_NAME AND
t1.TABLE_NAME = d.TABLE_NAME)
WHEN NOT MATCHED THEN
INSERT (SCHEMA, TABLE_NAME, TABLE_ALIAS)
VALUES (d.SCHEMA_NAME, d.TABLE_NAME, d.ALIAS_NAME)
Best of luck.
You should have a primary key, especially when you want to check for duplicate records and data integrity.
Provide a primary key for your table, or, if you somehow do not want to do that, create a unique constraint for all of the columns in the table, so no duplicate rows are possible.

Hive: Fast concatenate two tables into one?

I have two Hive tables of the same structure (schema). What would be an efficient SQL request to concatenate them into a single table with the same structure?
Update, this works quite fast in my case:
CREATE TABLE xy AS SELECT *
FROM (
SELECT *
FROM x
UNION ALL
SELECT *
FROM y
) tmp;
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union
"union all" is a right solution but might be expensive, resource/time wise. I'd recommend creating a table with two partitions, one for table A and another for Table B. This way, no need to merge (or union all). The merged table is available as soon as both partitions get populated.

sybase insert from another database in same server

i am trying to get all extra data from one data base and trying to insert into another.
But i want to omit the column name and am trying to make only the table name as hard coded to achieve this. But we have some fields which are system generated in a table like an id which is not that necessary a data but still will create a integrity issue. How can i do a insert of just the wanna be details omitting those above columns, the names of the columns to omit also changes.. I can't do a total insert, just the addition of some extra data.
so far i have come to this.
while 1=1
begin
if exists(select 1 from db1.table1 not in (select * from db2.table1)
begin
insert into db2.table1 (columns) select (columns) from db1.table1
end
if(rowCount=0)
break
end
please advise how i can optimize this to get the least possible hard coding
Have left the pk part intentionally, as
the query being big.
If you want to something like:
insert into TAB
select * from TAB2
or
insert into TAB
select col1,col2 from TAB2
or
insert into TAB (col1,col2)
select * from TAB2
where TAB1 and TAB2 have different count or type of columns it's not possible, because it will generate an error.

Hive: Create New Table from Existing Partitioned Table

I'm using Amazon's Elastic MapReduce and I have a hive table created based on a series of log files stored in Amazon S3 and split in folders by day like so:
data/day=2011-09-01/log_file.tsv
data/day=2011-09-02/log_file.tsv
I am currently trying to create an additional table which filters out some unwanted activity in these log files but I can't figure out how to do this and keep getting errors such as:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
If my initial table create statement looks something like this:
CREATE EXTERNAL TABLE IF NOT EXISTS table1 (
... fields ...
)
PARTITIONED BY ( DAY STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/data/';
That initial table works fine and I've been able to query it with no problems.
How then should I create a new table that shares the structure of the previous one but simply filters out data? This doesn't seem to work.
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
FROM table1
INSERT OVERWRITE TABLE table2
SELECT * WHERE
col1 = '%somecriteria%' AND
more criteria...
;
As I've stated above, this returns:
FAILED: Error in semantic analysis: need to specify partition columns because the destination table is partitioned.
Thanks!
This always works for me:
CREATE EXTERNAL TABLE IF NOT EXISTS table2 LIKE table1;
INSERT OVERWRITE TABLE table2 PARTITION (day) SELECT col1, col2, ..., day FROM table1;
ALTER TABLE table2 RECOVER PARTITIONS;
Notice that I've added 'day' as a column in the SELECT statement. Also notice that there is an ALTER TABLE line which is necessary for Hive to become aware of the partitions that were newly created in table2.
I have never used the like option.. so thanks for showing me that. Will that actually create all of the partitions that the first table has as well? If not, that could be the issue. You could try using dynamic partitions:
create external table if not exists table2 like table1;
insert overwrite table table2 partition(part) select col1, col2 from table1;
Might not be the best solution, as I think you have to specify your columns in the select clause (as well as the partition column in the partition clause).
And, you must turn on dynamic partitioning.
I hope this helps.

Resources