Creating HIVE External Table from read-only folder - hadoop

I have access to Hadoop cluster where I have read-only access to hdfs folder where the data is (in this case /data/table1/data_PART0xxx). I would like to build a HIVE EXTERNAL Table which would provide me easier way to query the data.
So, I created a table as follows:
CREATE EXTERNAL TABLE myDB.Table1 (column1 STRING, column2 STRING, column3 STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{10})(.{16})(.{10})"
)
LOCATION '/data/table1';
However, it gives me error:
Error: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [my_user] does not have [ALL] privilege on [hdfs://hadoopcluster/data/table1] (state=42000,code=40000)
which I understand, I don't have right to write anything, but how can I do this, so that the table is explicitly defined as read-only?
Edit: I know I can set it as CREATE TEMPORARY EXTERNAL TABLE ... but I would like to have a bit more permanent solution. Further, I have also no privileges to set the folder /data/table1 to mode 777. Isn't there a way to tell HIVE, this table is meant only to be queried, no further data will be added (at least not through HIVE).
Edit: Further, I have seen JIRA ticket for this since 2009 set to important, but still not resolved.

Related

Duplicate directory in HDFS

I've created an external hive table stored in "/cntt_sondn/hive/tables/test_orc" in hdfs using command:create external table test_lab.test_orc (col1 string, col2 string) stored as orc location '/cntt_sondn/hive/tables/test_orc';. It seems ok and the directory was created successfully by user hive. Then, I use NiFi to put a orc file in the directory, there's no problem, no error or warning thrown. However, when I browse to directory on NameNode UI, there are 2 directory with same name "test_orc", ones created by hive, the other created by my user. See image below:
In addition, it seems NiFi puts my orc file in the directory owned by user ra_vtg, so the directory created by user hive is empty. Therefore, no data inserted into the hive table.
Please explain why can these strange things happen.

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Hive error - Select * from table ;

I created one external table in hive which was successfully created.
create external table load_tweets(id BIGINT,text STRING)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION '/user/cloudera/data/tweets_raw';
But, when I did:
hive> select * from load_tweets;
I got the below error:
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.ByteArrayInputStream#5dfb0646; line: 1, column: 2]**
Please suggest me how to fix this. Is it the twitter o/p file which was created using flume was corrupted or anything else?
You'll need to do two additional things.
1) Put data into the file (perhaps using INSERT). Or maybe it's already there. In either case, you'll then need to
2) from Hive, msck repair table load_tweets;
For Hive tables, the schema and other meta-information about the data is stored in what's called the Hive Metastore -- it's actually a relational database under the covers. When you perform operations on Hive tables created without the LOCATION keyword (that is, internal, not external tables), the Hive will automatically update the metastore.
But most Hive use-cases cause data to be appended to files that are updated using other processes, and thus external tables are common. If new partitions are created externally, before you can query them with Hive you need to force the metastore to sync with the current state of the data using msck repair table <tablename>;.

Partitioning a table in Hadoop

I am going through an example in the O'Reilly Hadoop book about partitioning a table. Here is the code I am running.
This code creates a table, it seems to execute without errors.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
When I run the below command, it returns nothing, suspicious.
SHOW PARTITIONS logs;
When I run the next part of the example code, I get an Invalid path error.
LOAD DATA LOCAL INPATH '/user/paul/files/dt=2010-01-01/country=GB/test.out'
INTO TABLE logs
PARTITION (dt='2001-01-01', country='GB');
I have definitely created the file, and I can browse it through Hue at the following location.
/user/paul/files/dt=2010-01-01/country=GB
This is the specific error.
FAILED: SemanticException Line 1:23 Invalid path ''/user/paul/files/dt=2010-01-01/country=GB/test.out'': No files matching path file:/user/paul/files/dt=2010-01-01/country=GB/test.out
Am I missing something blatantly obvious here?
It just means file not found on the local file system at '/user/paul/files/dt=2010-01-01/country=GB/test.out'.
The file that you created '/user/paul/files/dt=2010-01-01/country=GB/test.out' is this file stored in HDFS or local file system? If it is in HDFS then you can't use local inpath
Remove local beforehand. I don't exactly remember but you may also need to alter the table beforehand: ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1';

Error creating a Hive table in HDInsight from a different blob container: Path is not legal

CREATE TABLE test1 (Column1 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH 'asv://hivetest#mystorageaccount.blob.core.windows.net/foldername' OVERWRITE INTO TABLE test1 ;
Loading the data generates the following error:
FAILED: Error in semantic analysis: Line 1:18 Path is not legal
''asv://hivetest#mystorageaccount.blob.core.windows.net/foldername'':
Move from:
asv://hivetest#mystorageaccount.blob.core.windows.net/foldername to:
asv://hdi1#hdinsightstorageaccount.blob.core.windows.net/hive/warehouse/test1
is not valid. Please check that values for params "default.fs.name"
and "hive.metastore.warehouse.dir" do not conflict.
The container hivetest is not my default HDInsight container. It is even located on a different storage account. However, the problem is probably not with the account credentials, as I have edited core-site.xml to include mystorageaccount.
How can I load data from a non-default container?
Apparently it's impossible by design to load data into a Hive table from a non-default container. The workaround suggested by the answer in the link is using an external table.
I was trying to use a non-external table so I can take advantage of partitioning, but apparently it's possible to partition even an external table, as explained here.

Resources