Can in insert data multiple times into a bucketed hive table - hadoop

I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table?
Does it affect the bucketing of the table?

I figured it out!!
Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such.
All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.

I dont think that should be a issue because you have declared that you want bucketing on user_id. so every time you would insert it will create 4 more files.
Bucketing is used for faster query processing so if it is making 4 more files everytime it will be making your query processing even faster.

Related

Hive scanning entire data for bucketed table

I was trying to optimize a hive SQL by bucketing the data on a single column. I created the table with following statement
CREATE TABLE `source_bckt`(
`uk` string,
`data` string)
CLUSTERED BY(uk) SORTED BY(uk) INTO 10 BUCKETS
Then inserted the data after executing "set hive.enforce.bucketing = true;"
When I run the following select "select * from source_bckt where uk='1179724';"
Even though the data is supposed to be in a single file which can be identified by the following equation HASH('1179724')%10 the mapreduce spawned scans through the entire set of files.
Any idea?
This optimization is not supported yet.
Current JIRA ticket status is PATCH AVAILABLE
https://issues.apache.org/jira/browse/HIVE-5831

Copying Hive managed table by copying partition directories into warehouse

I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE a YEAR, MONTH, DAY partition every time I ingest data for that date; I can just do a simple INSERT INTO and create a new INGESTION_KEY partition.
I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT for each partition, I thought it would be quicker to use distcp to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION to the new table.
So, this is all I'm doing:
hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123
hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"
Both are managed tables, the new table dest_tbl is clustered by the same column into the same number of buckets as the src_tbl, and the only difference in schema is the addition of INGESTION_KEY.
So far my SELECT * FROM dest_tbl shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT to a managed, bucketed table this way, or is this an acceptable alternative to INSERT if no transformations are being done on the copied data?
Thanks!!
Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..
There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition.., and it can add many partitions at once :
MSCK REPAIR TABLE dest_tbl;
Keep using Hive default partitioning format : partionKey=partitionValue

How Hive Partition works

I wanna know how hive partitioning works I know the concept but I am trying to understand how its working and store the in exact partition.
Let say I have a table and I have created partition on year its dynamic, ingested data from 2013 so how hive create partition and store the exact data in exact partition.
If the table is not partitioned, all the data is stored in one directory without order. If the table is partitioned(eg. by year) data are stored separately in different directories. Each directory is corresponding to one year.
For a non-partitioned table, when you want to fetch the data of year=2010, hive have to scan the whole table to find out the 2010-records. If the table is partitioned, hive just go to the year=2010 directory. More faster and IO efficient
Hive organizes tables into partitions. It is a way of dividing a table into related parts based on the values of partitioned columns such as date.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time.

Refresh one hive table from another hive table

I have a few Hive tables that i am bringing in from RDBMS using Sqoop incremental imports every hour and staging them. I am joining these tables and creating new dimension tables. Whenever i bring in new rows from RDBMS into Hive staging tables, I have to refresh the dimension tables. If there are no new rows, the refresh of dim tables should not be done. The hive version I'm using does not have ACID features.
Need some advice on how this could be achieved in hive.
You can INSERT new data in existing Hive tables, like any other database. And Hive also supports the WHERE NOT EXISTS clause.
INSERT INTO TABLE MyDim
SELECT Id, Blah1, Blah2
FROM MySource s
WHERE NOT EXISTS
(SELECT 1 FROM MyDim z WHERE z.Id =s.Id)
But there is a catch: each INSERT will create a new HDFS file, even when there are zero records involved. Too much fragmentation will reduce performance over time.
A weekly "compaction" job would be helpful (e.g. rename the fragmented table, re-CREATE the table, INSERT OVERWRITE from renamed table, drop renamed)

Hadoop & Hive as warehouse: daily data deliveries

I am evaluating the combination of hadoop & hive (& impala) as a repolacement for a large data warehouse. I already set up a version and performance is great in read access.
Can somebody give me any hint what concept should be used for daily data deliveries to a table?
I have a table in hive based on a file I put into hdfs. But now I have on a daily basis new transactional data coming in.
How do I add them ti the table in hive.
Inserts are not possible. HDFS cannot append. So whats the gernal concept I need to follow.
Any advice or direction to documentation is appreciated.
Best regards!
Hive allows for data to be appended to a table - the underlying implementation of how this happens in HDFS doesn't matter. There are a number of things you can do append data:
INSERT - You can just append rows to an existing table.
INSERT OVERWRITE - If you have to process data, you can perform an INSERT OVERWRITE to re-write a table or partition.
LOAD DATA - You can use this to bulk insert data into a table and, optionally, use the OVERWRITE keyword to wipe out any existing data.
Partition your data.
Load data into a new table and swap the partition in
Partitioning is great if you know you're going to be performing date based searches and gives you the ability to use options 1, 2, & 3 at either the table or partition level.
Inserts are not possible
Inserts are possible ,like you can create a new table and insert the data from new table to old table.
But simple solution is You can load data of the file into Hive table with the below command.
load data inpath '/filepath' [overwrite] into table tablename;
If you use overwrite then only existing data replced with new data otherwise It is appending only.
You can even schedule the script by creating a shell script.

Resources