I have a production hive table partitioned by date. New data are generated hourly, and I need to merge the new data into the hive table.
In case there're duplicate data insertion requests or data overlap among hourly requests, I want to perform dedup to each partition whenever I update it.
I reviewed the answer to How to Append new data to already existing hive table
, but still have some confusions:
How should I merge the new data pieces into the existing partition?
I mean, should I create a tmp table for the new data, pull existing data into the tmp table, make dudup and OVERWRITE back the partition of the production table?
Is it possible "dirty read" could occur during the overwriting of the partition of the production hive table? Is there any solution to this?
I'm wondering if there's anything like atomic RENAME.
Related
We are stuck with a problem where-in we are trying to do a near real time sync between a RDBMS(Source) and hive (Target). Basically the source is pushing the changes (inserts, updates and deletes) into HDFS as avro files. These are loaded into external tables (with avro schema), into the Hive. There is also a base table in ORC, which has all the records that came in before the Source pushed in the new set of records.
Once the data is received, we have to do a de-duplication (since there could be updates on existing rows) and remove all deleted records (since there could be deletes from the Source).
We are now performing a de-dupe using rank() over partitioned keys on the union of external table and base table. And then the result is then pushed into a new table, swap the names. This is taking a lot of time.
We tried using merges, acid transactions, but rank over partition and then filtering out all the rows has given us the best possible time at this moment.
Is there a better way of doing this? Any suggestions on improving the process altogether? We are having quite a few tables, so we do not have any partitions or buckets at this moment.
You can try with storing all the transactional data into Hbase table.
Storing data into Hbase table using Primary key of RDBMS table as Row Key:-
Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format.
You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records.
Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table.
As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table.
If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table.
Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive.
https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey,
if we are going to have millions of records then we need to do some tuning to the hive queries
Tuning Hive Queries That Uses Underlying HBase Table
I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE a YEAR, MONTH, DAY partition every time I ingest data for that date; I can just do a simple INSERT INTO and create a new INGESTION_KEY partition.
I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT for each partition, I thought it would be quicker to use distcp to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION to the new table.
So, this is all I'm doing:
hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123
hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"
Both are managed tables, the new table dest_tbl is clustered by the same column into the same number of buckets as the src_tbl, and the only difference in schema is the addition of INGESTION_KEY.
So far my SELECT * FROM dest_tbl shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT to a managed, bucketed table this way, or is this an acceptable alternative to INSERT if no transformations are being done on the copied data?
Thanks!!
Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..
There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition.., and it can add many partitions at once :
MSCK REPAIR TABLE dest_tbl;
Keep using Hive default partitioning format : partionKey=partitionValue
I have a partitioned table in greenplum(modeled after psql), which has been partitioned with specific range of values.
Now, i have to insert the data again into the same table. New values for Partitions might overlap with existing ones. I have created alter command with new start and end dates. But, if the overlaps are there, the command fails. So, i need to create partition for each date, in order to avoid whole command failure.
Just wondering, if there is a way in greenplum to create partitions based on the inserted data automatically, just like hive does.
thanks for your help.
Greenplum does not (currently) create additional partitions for data which does not fit into an existing partition.
If you have a default partition on the table it will receive all the records which do not fit into one of the specified partitions. You can then use ALTER TABLE ... SPLIT DEFAULT PARTITION (see the documentation if required) to create the new partitions for any new dates at the end of the load batch.
I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.
So, I have a .NET program doing batch loading of records into partitioned tables using array bound stored procedure calls via Oracle ODP.NET, but that's neither here nor there.
What I would like to know is: because I have a partitioned index on said tables, the speed of the batch load is pretty slow. I fully understand that I cannot drop an index partition, but I would obviously prefer not to have to drop and rebuild the entire index since that will take considerably more time to execute. Is this my only recourse?
Is there a fairly simple way to drop the partition itself and then rebuild the partition and index partition that would save time and go about accomplishing my goal?
Are you loading an entire partition at once? Or are you merely adding new rows to an existing partition? Are all the indexes equipartitioned with the table?
Normally, if you are loading data into a partitioned table, your partitioning scheme is chosen so that each load will put data into a fresh partition. If that is the case, you can use partition exchange to load the data. In a nutshell, you load data into an (unindexed) staging table whose structure matches the real table, you create the indexes to match the indexes on the real table, and then do
ALTER TABLE partitioned_table
EXCHANGE PARTITION new_partition_name
WITH TABLE staging_table_name
WITHOUT VALIDATION;