Incremental load in greenplum - etl

I have external and internal tables in greenplum. External table is pointing in hdfs a csv file. This csv file in Hdfs getting load with full data of a table every hour.
What is best way to load data incrementally in internal table of the greenplum.

create dimension table in greenplum where it store last of the till where it loaded previously like timestamp or any datapoint.
use above dimension table , you can an UDF return in such a way that evry one hour whenever a new file arrives , it will loaded to stage/extrenal table and then with last loaded parameters from dimension table the , it will pick only relevant/new records to process further.
Thanks,
shobha

Related

Impact of Repeatedly Creating and Deleting Hive Table

I have an use case which required around 200 hive parquet table.
I need to load these parquet table from flat text files. But we can not directly load parquet table from flat text file.
So I am using following approach
Created a temporary managed text table.
Loaded temp table with text data.
Created external parquet table.
Loaded parquet table with text table using select query.
Dropped text file for temporary text table (but keep table in metastore).
As this approach is keeping temporary metadata (for 200 tables) in metastore. So I have second approach is that I will drop temporary text table too along with text files from hdfs. And next time re-create temporary table and delete once parquet get created.
Now, as I need to follow above steps for all 200 tables for every 2 hours, So will creating and deleting tables from metastore impact anything in cluster during production?
Which approach can impact production, keeping temporary metadata in metastore, creating and deleting tables (metadata) from hive metastore?
Which approach can impact production, keeping temporary metadata in
metastore, creating and deleting tables (metadata) from hive
metastore?
No, there is no impact, the backend of the HiveMetastore should be able to handle 200 * n Changes per hour easily. If you're unsure start with 50 tables and monitor the backend DB performance.

measure the time of load tables with data in hive (its possible?)

I created a table in hive from data stored in hdfs with this command:
create external table users
(ID INT, NAME STRING, ADRESS STRING, EMAIL STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/data/tpch/users';
This users table stored in hdfs has 10gb. And the create table just took 1second to create the table and load the data. So this is strange or it is really fast. My doubt is, to check the time of load tables with data in hive can be with that command above with location? Or that command just create a reference to data stored in hdfs?
So what is the correct way to check the time to load data in hive tables?
Because 1second seems really fast, mysql or another relational database probably need 30 or more minutes for load 10gb of data into a table.
Your create table statement is pointing to external storage for the tables, so Hive is not copying the data over. The documentation explains external tables like this:
External Tables
The EXTERNAL keyword lets you create a table and provide a LOCATION so
that Hive does not use a default location for this table. This comes
in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather
than being stored in a folder specified by the configuration property
hive.metastore.warehouse.dir.
This is not 100% explicit, but the idea is that Hive is pointing to the table contents rather than managing it directly.

HQL - How to Copy/Move data in few partitions from one table to another

I have a Table (main_table) which is partitioned and stores history of records with a flag to indicate if the record is deleted or not. I have another table9del_table), which has same schema as main_table, but stores only deleted records for a day (delete_falg='Y').
As a process I need to move records available in del_table to main_table on daily basis. I am trying to write a LOAD DATA INPATH command, which could move data available in respective partitions of del_table to corresponding partitions of main_table but none of my tries seems to work. Please let me know if it is possible to achieve it by using LOAD DATA INPATH command, withoud specifying individual partitions?
I am trying below steps but it is failing in 2nd step:
set nonstrict hive property:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
Do you have to use the Loan data inpath?
If not you could do the following https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
This would regenerate the whole table daily though.
Try setting this property first:
set hive.exec.dynamic.partition.mode=nonstrict;
then run your command:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
For more info you can refer to this link: Partition

updating Hive external table with HDFS changes

lets say, I created Hive external table "myTable" from file myFile.csv ( located in HDFS ).
myFile.csv is changed every day, then I'm interested to update "myTable" once a day too.
Is there any HiveQL query that tells to update the table every day?
Thank you.
P.S.
I would like to know if it works the same way with directories: lets say, I create Hive partition from HDFS directory "myDir", when "myDir" contains 10 files. next day "myDIr" contains 20 files (10 files were added). Should I update Hive partition?
There are two types of tables in Hive basically.
One is Managed table managed by hive warehouse whenever you create a table data will be copied to internal warehouse.
You can not have latest data in the query output.
Other is external table in which hive will not copy its data to internal warehouse.
So whenever you fire query on table then it retrieves data from the file.
SO you can even have the latest data in the query output.
That is one of the goals of external table.
You can even drop the table and the data is not lost.
If you add a LOCATION '/path/to/myFile.csv' clause to your table create statement, you shouldn't have to update anything in Hive. It will always use the latest version of the file in queries.

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources