Is there a way to prevent a Hive table from being overwritten if the SELECT query of the INSERT OVERWRITE does not return any results - hadoop

I am developing a batch job that loads data into Hive tables from HDFS files. The flow of data is as follows
Read the file received in HDFS using an external Hive table
INSERT OVERWRITE the final hive table from the external Hive table applying certain transformations
Move the received file to Archive
This flow works fine if there is a file in the input directory for the external table to read during step 1.
If there is no file, the external table will be empty and as a result executing step 2 will empty the final table. If the external table is empty, I would like to keep the existing data in the final table (the data loaded during the previous execution).
Is there a hive property that I can set so that the final table is overwritten only if we are overwriting it with some data?
I know that I can check if the input file exists using an HDFS command and conditionally launch the Hive requests. But I am wondering if I can achieve the same behavior directly in Hive which would help me avoid this extra verification

Try to add dummy partition to your table, say LOAD_TAG and use dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE your_table PARTITION(LOAD_TAG)
select
col1,
...
colN,
'dummy_value' as LOAD_TAG
from source_table;
The partition value should always be the same in your case.

Related

data deleted from hdfs after using hive load command [duplicate]

When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How to force CTAS to generate a single file?

I'm using HDP 2.5 with hive service. When i create hive table by using below query;
create table Sample_table
row format delimited
fields terminated by '|'
stored as textfile
AS
select *
from sample_table_unique
where state='AL';
Either i can able to create external table with specific location.
My question is when i create table/external table the stored file has been splitted ie. like below wise files has been splitted.
/apps/hive/warehouse/sampledb/sample_table:
00000_0,
00001_0,
00002_0,
00003_0,
I don't want those splitted file, i want one merged file like 00000_0. I don't know how it happen.Please tell me how do i resolve this issue.
The SELECT statement runs a mapper/mapreduce (depends on the select query) job to write data into the target table sample_table from the source table sample_table_unique.
Based on the number of tasks, the number of files generated may vary.
To merge them into one, you can set these properties either for the session on permanently in hive-site.xml
hive> SET hive.merge.mapfiles=true;
hive> SET hive.merge.mapredfiles=true;
hive> SET hive.merge.smallfiles.avgsize=16000000;
hive> SET hive.merge.size.per.task=256000000;
In case of TEZ execution engine, use
hive> SET hive.merge.tezfiles=true;
instead of mapfiles and mapredfiles.
When the average output file size of a job is less than this hive.merge.smallfiles.avgsize number, Hive will start an additional map-reduce job to merge the output files into bigger files.
The values for hive.merge.smallfiles.avgsize and hive.merge.size.per.task are default ones, change them accordingly to the input size.

HQL - How to Copy/Move data in few partitions from one table to another

I have a Table (main_table) which is partitioned and stores history of records with a flag to indicate if the record is deleted or not. I have another table9del_table), which has same schema as main_table, but stores only deleted records for a day (delete_falg='Y').
As a process I need to move records available in del_table to main_table on daily basis. I am trying to write a LOAD DATA INPATH command, which could move data available in respective partitions of del_table to corresponding partitions of main_table but none of my tries seems to work. Please let me know if it is possible to achieve it by using LOAD DATA INPATH command, withoud specifying individual partitions?
I am trying below steps but it is failing in 2nd step:
set nonstrict hive property:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
Do you have to use the Loan data inpath?
If not you could do the following https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-DynamicpartitionInsert
This would regenerate the whole table daily though.
Try setting this property first:
set hive.exec.dynamic.partition.mode=nonstrict;
then run your command:
LOAD DATA INPATH '...../del_table/' into table main_table partition(partition_col_name)
For more info you can refer to this link: Partition

Hive loading in partitioned table

I have a log file in HDFS, values are delimited by comma. For example:
2012-10-11 12:00,opened_browser,userid111,deviceid222
Now I want to load this file to Hive table which has columns "timestamp","action" and partitioned by "userid","deviceid". How can I ask Hive to take that last 2 columns in log file as partition for table? All examples e.g. "hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');" require definition of partitions in the script, but I want partitions to set up automatically from HDFS file.
The one solution is to create intermediate non-partitioned table with all that 4 columns, populate it from file and then make an INSERT into first_table PARTITION (userid,deviceid) select from intermediate_table timestamp,action,userid,deviceid; but that is and additional task and we will have 2 very similiar tables.. Or we should create external table as intermediate.
Ning Zhang has a great response on the topic at http://grokbase.com/t/hive/user/114frbfg0y/can-i-use-hive-dynamic-partition-while-loading-data-into-tables.
The quick context is that:
Load data simply copies data, it doesn't read it so it cannot figure out what to partition
Would suggest that you load data into an intermediate table first (or using an external table pointing to all the files) and then letting partition dynamic insert to kick in to load it into a partitioned table
As mentioned in #Denny Lee's answer, we need to involve a staging table(invites_stg)
managed or external and then INSERT from staging table to partitioned table(invites in this case).
Make sure we have these two properties set to:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
And finally insert to invites,
INSERT OVERWRITE TABLE India PARTITION (STATE) SELECT COL's FROM invites_stg;
Refer this link for help: http://www.edupristine.com/blog/hive-partitions-example
I worked this very same scenario, but instead, what we did is create separate HDFS data files for each partition you need to load.
Since our data is coming from a MapReduce job, we used MultipleOutputs in our Reducer class to multiplex the data into their corresponding partition file. Afterwards, it is just a matter of building the script using the Partition from the HDFS file name.
How about
LOAD DATA INPATH '/path/to/HDFS/dir/file.csv' OVERWRITE INTO TABLE DB.EXAMPLE_TABLE PARTITION (PARTITION_COL_NAME='PARTITION_VALUE');
CREATE TABLE India (
OFFICE_NAME STRING,
OFFICE_STATUS STRING,
PINCODE INT,
TELEPHONE BIGINT,
TALUK STRING,
DISTRICT STRING,
POSTAL_DIVISION STRING,
POSTAL_REGION STRING,
POSTAL_CIRCLE STRING
)
PARTITIONED BY (STATE STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
5. Instruct hive to dynamically load partitions
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

Resources