Overwrite multiple partitions at once Hadoop - hadoop

I have a partitioned external table Hive that i have to overwrite with some records.
There are a lot of dates that we need to reload and the queries are a bit heavy.
What we want to know is if it is possible, in a simultaneous way load two or more different partitions at the same time?
For example, 3 (or more) processes running in parallel like:
Process1
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221110;
Process2
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221111;
Process3
insert overwrite table_prod partition (data_date)
select * from table_old where data_date=20221112;

Short answer is yes you can.
Real question is how - because you have to consider large volume of data.
Option 1 - yes, you can use shell script or some scheduler tool. But the query you're using is going to be slow. you can use static partitioning which is way faster.
insert overwrite table_prod partition (data_date=20221110) -- pls note i mentioned partition value.
select
col1, col2... -- exclude data_date column from select list
from table_old where data_date=20221110;
Option 2 - You can also use dynamic partition scheme to load all the partitions at once. This is perf intensive operation but you dont have to create any shell script or any other process.
insert overwrite table_prod partition (data_date)
select * from table_old

Related

Oracle `partition_by` in select clauses, does it create these partitions permantly?

I only have a superficial understanding on partitions in Oracle, but, I know you can create persistent partitions on Oracle, for example within a create table statement, but, when using partition by clauses within a select statement? Will Oracle create a persistent partition, for caching reasons or whatever, or will the partition be "temporary" in some sense (e.g., it will be removed at the end of the session, the query, or after some time...)?
For example, for a query like
SELECT col1, first_value(col2)
over (partition by col3 order by col2 nulls last) as colx
FROM tbl
If I execute that query, will Oracle create a partition to speed up the execution if I execute it again, tomorrow or three months later? I'm worry about that because I don't know if it could cause memory exhaustion if I abuse that feature.
partition by is used in the query(windows function) to fetch the aggregated result using the windows function which is grouped by the columns mentioned in the partition by. It behaves like group by but has ability to provide grouped result for each row without actually grouping the final outcome.
It has nothing to do with table/index partition.
scope of this partition by is just this query and have no impact on table structure.

I have n(large) number of small sized txt files in hive

I have n(large) number of small sized txt files which i want to merge into k(small) number of files
if you have hive table on top of these txt files then use
insert overwrite <db>.<existing_table> select * from <db>.<existing_table> order by <col_name>;
Hive supports select and overwriting same table, order by clause will force to run 1 reducer which results only 1 file to be created in the directory.
However if you are having large data then order by clause will not perform well, then use sort by (or) clustered by clause to initiate more than 1 reducer.
insert overwrite <db>.<existing_table> select * from <db>.<existing_table> sort by <col_name>;

Combine Multiple Hive Tables as single table in Hadoop

Hi I have multiple Hive tables around 15-20 tables. All the tables will be common schema . I Need to combine all the tables as single table.The single table should be queried from reporting tool, So performance is also needs to be care..
I tried like this..
create table new as
select * from table_a
union all
select * from table_b
Is there any other way to combine all the tables more efficient. Any help will be appreciated.
Hive would be processing in parallel if you set "hive.exec.parallel" as true. With "hive.exec.parallel.thread.number" you can specify the number of parallel threads. This would increase the overall efficiency.
If you are trying to merge table_A and table_b into a single one, the easiest way is to use the UNION ALL operator. You can find the syntax and use cases here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union

Oracle Performance: Partition Split vs Insert Append

I am currently refactoring a data loading process in Oracle 12c and am exploring partitioning as a solution. My data is categorised by date and I have ~500k records per date which seems to fit the "minimum optimal" for partitioning, or so I am told. The original plan was to use a staging table to load the data, then add a dummy partition to the main table and perform a partition exchange. However, my data load contains data from several days rather than from one day. Preliminary research suggests there are two methods to solve this problem:
Option 1: Perform the partition exchange, then split the large partition in a loop
ALTER TABLE MAIN_TABLE ADD PARTITION DUMMY_PARTITION VALUES LESS THAN (TO_DATE('1-1-9999', 'DD-MM-YYYY'));
ALTER TABLE MAIN_TABLE
EXCHANGE PARTITION DUMMY_PARTITION
WITH TABLE STAGING_TABLE
WITHOUT VALIDATION UPDATE GLOBAL INDEXES;
BEGIN
FOR row IN (select distinct to_char(DATE_FIELD+1, 'YYYYMMDD') DATE_FIELD from PARTITIONED_TABLE order by DATE_FIELD)
LOOP
EXECUTE IMMEDIATE 'ALTER TABLE MAIN_TABLE SPLIT PARTITION DUMMY_PARTITION AT (TO_DATE('''||row.DATE_FIELD||''', ''YYYYMMDD'')) INTO (PARTITION p'||row.DATE_FIELD||', PARTITION DUMMY_PARTITION) UPDATE GLOBAL INDEXES';
END LOOP;
END;
/
Option 2: Perform an insert append
INSERT /*+ append */ INTO MAIN_TABLE SELECT * FROM STAGING_TABLE;
Somehow, it seems like splitting the partition is a slower process than doing the insert. Is this expected behaviour or am I missing something?
There is a fast split optimization, depending on the circumstances. However from your description, I would simply do the INSERT /+* APPEND */
You may want to employ some parallelism too, if you have the resources and you are looking to speed up the inserts.

Data loading in Oracle

I am facing problem in loading data. I have to copy 800,000 rows from one table to another in Oracle database.
I tried for 10,000 rows first but the time it took is not satisfactory. I tried using the "BULK COLLECT" and "INSERT INTO SELECT" clause but for both the cases response time is around 35 minutes. This is not the desired response I'm looking for.
Does anyone have any suggestions?
Anirban,
Using an "INSERT INTO SELECT" is the fastest way to populate your table. You may want to extend it with one or two of these hints:
APPEND: to use direct path loading, circumventing the buffer cache
PARALLEL: to use parallel processing if your system has multiple cpu's and this is a one-time operation or an operation that takes place at a time when it doesn't matter that one "selfish" process consumes more resources.
Just using the append hint on my laptop copies 800,000 very small rows below 5 seconds:
SQL> create table one_table (id,name)
2 as
3 select level, 'name' || to_char(level)
4 from dual
5 connect by level <= 800000
6 /
Tabel is aangemaakt.
SQL> create table another_table as select * from one_table where 1=0
2 /
Tabel is aangemaakt.
SQL> select count(*) from another_table
2 /
COUNT(*)
----------
0
1 rij is geselecteerd.
SQL> set timing on
SQL> insert /*+ append */ into another_table select * from one_table
2 /
800000 rijen zijn aangemaakt.
Verstreken: 00:00:04.76
You mention that this operation takes 35 minutes in your case. Can you post some more details, so we can see what exactly is taking 35 minutes?
Regards,
Rob.
I would agree with Rob. Insert into () select is the fastest way to do this.
What exactly do you need to do? If you're trying to do a table rename by copying to a new table and then deleting the old, you might be better off doing a table rename:
alter table
table
rename to
someothertable;
INSERT INTO SELECT is the fastest way to do it.
If possible/necessary, disable all indexes on the target table first.
If you have no existing data in the target table, you can also try CREATE AS SELECT.
As with the above, I would recommend the Insert INTO ... AS select .... or CREATE TABLE ... AS SELECT ... as the fastest way to copy a large volume of data between two tables.
You want to look up the direct-load insert in your oracle documentation. This adds two items to your statements: parallel and nologging. Repeat the tests but do the following:
CREATE TABLE Table2 AS SELECT * FROM Table1 where 1=2;
ALTER TABLE Table2 NOLOGGING;
ALTER TABLE TABLE2 PARALLEL (10);
ALTER TABLE TABLE1 PARALLEL (10);
ALTER SESSION ENABLE PARALLEL DML;
INSERT INTO TABLE2 SELECT * FROM Table 1;
COMMIT;
ALTER TABLE 2 LOGGING:
This turns off the rollback logging for inserts into the table. If the system crashes, there's not recovery and you can't do a rollback on the transaction. The PARALLEL uses N worker thread to copy the data in blocks. You'll have to experiment with the number of parallel worker threads to get best results on your system.
Is the table you are copying to the same structure as the other table? Does it have data or are you creating a new one? Can you use exp/imp? Exp can be give a query to limit what it exports and then imported into the db. What is the total size of the table you are copying from? If you are copying most of the data from one table to a second, can you instead copy the full table using exp/imp and then remove the unwanted rows which would be less than copying.
try to drop all indexes/constraints on your destination table and then re-create them after data load.
use /*+NOLOGGING*/ hint in case you use NOARCHIVELOG mode, or consider to do the backup right after the operation.

Resources