Not able to create HIVE table with JSON format using SERDE - hadoop

We are very new to Hadoop and Hive. We created normal Hive table and loaded data as well. But We are facing issue when we are creating table in Hive with JSON format. I have added serde jar also. We get the following error:
create table airline_tables(Airline string,Airlineid string,Sourceairport string,Sourceairportid string,Destinationairport string,`Destinationairportid string,Codeshare string,Stop string,Equipment String)` ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'`location '/home/hduser/part-000';`
FAILED: Error in metadata: java.lang.NullPointerException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Location is HDFS.
I am using Hive-0.9.0 and hadoop 1.0.1.

As i can see. You are using the native table of hive. So in that case you need to load the data in the table. If you dont want to load the data then you just put the path of that particular location in the table creation script. So, i think you missed the keyword "EXTERNAL" again create the table like this. create external table airline_tables(blah blah.....)

Related

Hive Table Creation based on file structure

i have one doubt, is there any way in HIVE which create table during load to hive warehouse or external table.
As i know hive is based on Schema On Read. so table structure must sync with file structure. but if file size is huge and we don't know its structure for example columns and their datatypes.
Than how to load those file to hive table.
so in short how to load file from HDFS to HIVE Table without knowing its schema structure.
New to Hive, Pardon if my understanding is wrong.
Thanks
By using sqoop you can create hive table while importing data.
Please refer to this link to create hive table while importing data
(or)
if you have imported data in AVRO format then you can generate avro schema by using
/usr/bin/Avro/avro-tools-*.jar then use the generated avro schema while creating table in hive then hive uses the schema and reads the data from HDFS.
Please refer to this link to extract schema from avro data file
(or)
While importing data using sqoop --as-avrodatefile then sqoop creates .avsc file with schema in it, so we can use this .avsc file creating the table.
CREATE EXTERNAL TABLE avro_tbl
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '<hdfs-location>'
TBLPROPERTIES ('avro.schema.url'='<schema-file>');
(or)
By using NiFi to import data NiFi pulls data in avro format by using ExtractAvroMetadata processor we can extract the avro schema and store into HDFS and create table by using this avro schema.
If you want to create table in ORC format then by using ConvertAvroToOrc processor adds hive.ddl attribute to the flowfile as we can execute the ddl statement to create orc table in hive.

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

How to create Hive table on top orc format data?

I have source data in orc format on HDFS.
I created external hive table on the top HDFS data with below command. I am using hive 1.2.1 version.
CREATE EXTERNAL TABLE IF NOT EXISTS test( table_columns ... ) ROW
FORMAT FIELDS TERMINATED BY '\u0001' STORED AS orc LOCATION 'path'
TBL PROPERTIES("orc.compress"="SNAPPY");
But while selecting data from table I am getting this exception.
"protobuf.InvalidProtocolBufferException: Protocol message was too large"
Please help me to resolve this issue.
Thanks.

Hive error - Select * from table ;

I created one external table in hive which was successfully created.
create external table load_tweets(id BIGINT,text STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/cloudera/data/tweets_raw';
But, when I did:
hive> select * from load_tweets;
I got the below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream#5dfb0646; line: 1, column: 2]**
Please suggest me how to fix this. Is it the twitter o/p file which was created using flume was corrupted or anything else?
You'll need to do two additional things.
1) Put data into the file (perhaps using INSERT). Or maybe it's already there. In either case, you'll then need to
2) from Hive, msck repair table load_tweets;
For Hive tables, the schema and other meta-information about the data is stored in what's called the Hive Metastore -- it's actually a relational database under the covers. When you perform operations on Hive tables created without the LOCATION keyword (that is, internal, not external tables), the Hive will automatically update the metastore.
But most Hive use-cases cause data to be appended to files that are updated using other processes, and thus external tables are common. If new partitions are created externally, before you can query them with Hive you need to force the metastore to sync with the current state of the data using msck repair table <tablename>;.

tblproperties("skip.header.line.count"="1") added while creating table in hive is making some issue in Imapla

I have created a table in Hive, and need to load the data using CSV file. So while creating table i mention the table property "tblproperties("skip.header.line.count"="1")".
And I have loaded data into my table. This is my input file content
After loading data i am able to see the output in Hive console. Where as when i am fetching data from the same table in Impala, it is giving some problem and not skipping the header as i mentioned at the time of creation of table.
The impala result is like below.
Now my question is
Why Impala is not able to take the table properties and skipping the header.?
Please give me some information about it.

Resources