Impala paritioned table with hdfs - hadoop

I have data stored in hdfs in the below format and inserted this data in impala partition table using "alter table add partition" command.
/user/impala/subscriber_data/year=2013/month=10/day=01
/user/impala/subscriber_data/year=2013/month=10/day=02
and everything is working fine.
Now I have a new data with month and year as 10 and 01. Now I need to process this data and append this data into existing hdfs directory(year=2013/month=10/day=01).
When I try to process and insert into hdfs directory, its giving error as output directory already exists.
Is there any way to append the new data into existing hdfs directory without deleting the existing directory?
Also, how to insert the new data into existing partition using impala? (I have only table with partition on year,month,day).

to insert into existing partition, you have to drop the existing partition, and add it back with all the files that make up that partition including your new data.

Related

How to safely append data into a partitioned Hive table?

I have a production hive table partitioned by date. New data are generated hourly, and I need to merge the new data into the hive table.
In case there're duplicate data insertion requests or data overlap among hourly requests, I want to perform dedup to each partition whenever I update it.
I reviewed the answer to How to Append new data to already existing hive table
, but still have some confusions:
How should I merge the new data pieces into the existing partition?
I mean, should I create a tmp table for the new data, pull existing data into the tmp table, make dudup and OVERWRITE back the partition of the production table?
Is it possible "dirty read" could occur during the overwriting of the partition of the production hive table? Is there any solution to this?
I'm wondering if there's anything like atomic RENAME.

Does external hive table refreshes itself, when file is added to pointing directory

I have a directory in HDFS, everyday one processed file is placed in that directory with DateTimeStamp in file name, if I create external table on top of that Directory location, does external table refreshes itself when every day file comes and resides in that directory ??
If you add files into table directory or partition directory, does not matter, external or managed table in Hive, the data will be accessible for queries, you do not need to do any additional steps to make data available, no refresh is necessary.
Hive table/partition is a metadata (DDL, location, statistics, access permissions, etc) plus data files in the location. So, data is stored in the table/partition location in HDFS.
Only if you create new directory for new partition which is not created yet, then you will need to execute ALTER TABLE ADD PARTITION LOCATION=<new location> or MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
If you add files into already created table/partition locations, no refresh is necessary.
CBO can use statistics for query calculation without reading data files, for example count(*). It works for simple queries only, like count(*), max().
If you are using CBO with statistics for query calculation, you may need to refresh it using ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS. See this answer for more details: https://stackoverflow.com/a/39914232/2700344
If you do not need statistics and want your table location to be scanned every time you query it, switch it off: set hive.compute.query.using.stats=false;

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Can Hive table automatically update when underlying directory is changed

If I build a Hive table on top of some S3 (or HDFS) directory like so:
create external table newtable (name string)
row format delimited
fields terminated by ','
stored as textfile location 's3a://location/subdir/';
When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?
On HDFS each file scanned each time table being queried as #Dudu Markovitz pointed. And files in HDFS are immediately consistent.
Update: S3 is also strongly consistent now, so removed part about eventual consistency.
Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344
Everything #leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...

Copying Hive managed table by copying partition directories into warehouse

I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE a YEAR, MONTH, DAY partition every time I ingest data for that date; I can just do a simple INSERT INTO and create a new INGESTION_KEY partition.
I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT for each partition, I thought it would be quicker to use distcp to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION to the new table.
So, this is all I'm doing:
hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123
hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"
Both are managed tables, the new table dest_tbl is clustered by the same column into the same number of buckets as the src_tbl, and the only difference in schema is the addition of INGESTION_KEY.
So far my SELECT * FROM dest_tbl shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT to a managed, bucketed table this way, or is this an acceptable alternative to INSERT if no transformations are being done on the copied data?
Thanks!!
Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..
There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition.., and it can add many partitions at once :
MSCK REPAIR TABLE dest_tbl;
Keep using Hive default partitioning format : partionKey=partitionValue

Resources