tblproperties("skip.header.line.count"="1") added while creating table in hive is making some issue in Imapla - hadoop

I have created a table in Hive, and need to load the data using CSV file. So while creating table i mention the table property "tblproperties("skip.header.line.count"="1")".
And I have loaded data into my table. This is my input file content
After loading data i am able to see the output in Hive console. Where as when i am fetching data from the same table in Impala, it is giving some problem and not skipping the header as i mentioned at the time of creation of table.
The impala result is like below.
Now my question is
Why Impala is not able to take the table properties and skipping the header.?
Please give me some information about it.

Related

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How to delete existing record that is already loaded using hive

I am loading data as per daily routine in a external table of hive from local file system and it is around one year of data I have in my table. Today client informed me that the yesterday`s data was incorrect. Now how to delete the yesterday's data from the table which has already a huge amount of data in it.
You can only delete data from hive table by using Hive Transaction Management.But there are certain limitations:
1)File format should be orc type.
2)Your table must be bucketed.
3)Transaction can not be enabled on external table because its out of meta store control.
By default transaction management feature is off. You can turn this on by updating hive-site.xml file.

Incremental load in greenplum

I have external and internal tables in greenplum. External table is pointing in hdfs a csv file. This csv file in Hdfs getting load with full data of a table every hour.
What is best way to load data incrementally in internal table of the greenplum.
create dimension table in greenplum where it store last of the till where it loaded previously like timestamp or any datapoint.
use above dimension table , you can an UDF return in such a way that evry one hour whenever a new file arrives , it will loaded to stage/extrenal table and then with last loaded parameters from dimension table the , it will pick only relevant/new records to process further.
Thanks,
shobha

Can Hive table automatically update when underlying directory is changed

If I build a Hive table on top of some S3 (or HDFS) directory like so:
create external table newtable (name string)
row format delimited
fields terminated by ','
stored as textfile location 's3a://location/subdir/';
When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?
On HDFS each file scanned each time table being queried as #Dudu Markovitz pointed. And files in HDFS are immediately consistent.
Update: S3 is also strongly consistent now, so removed part about eventual consistency.
Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344
Everything #leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...

Not able to create HIVE table with JSON format using SERDE

We are very new to Hadoop and Hive. We created normal Hive table and loaded data as well. But We are facing issue when we are creating table in Hive with JSON format. I have added serde jar also. We get the following error:
create table airline_tables(Airline string,Airlineid string,Sourceairport string,Sourceairportid string,Destinationairport string,`Destinationairportid string,Codeshare string,Stop string,Equipment String)` ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'`location '/home/hduser/part-000';`
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Location is HDFS.
I am using Hive-0.9.0 and hadoop 1.0.1.
As i can see. You are using the native table of hive. So in that case you need to load the data in the table. If you dont want to load the data then you just put the path of that particular location in the table creation script. So, i think you missed the keyword "EXTERNAL" again create the table like this. create external table airline_tables(blah blah.....)

Resources