Issue with load data into HIVE - hadoop

We have launched two EMR in AWS and installed the hadoop and hive-0.11.0 in one EMR and hive-0.13.1 other one.
Everything seems to be working fine but while trying to loading data into TABLE it's giving the below error and it happening in both the Hive Servers.
ERROR MESSAGE:
An error occurred when executing the SQL command: load data inpath
's3://buckername/export/employee_1/' into table employee_2 Query
returned non-zero code: 10028, cause: FAILED: SemanticException [Error
10028]: Line 1:17 Path is not legal
''s3://buckername/export/employee_1/'': Move from:
s3://buckername/export/employee_1 to:
hdfs://XXX.XX.XXX.XX:X000/mnt/hive_0110/warehouse/employee_2 is not
valid. Please check that values for params "default.fs.name" and
"hive.metastore.warehouse.dir" do not conflict. [SQL State=42000, DB
Errorcode=10028]
I searched for the reason and mean of this message, I found this link but when tried to execute command suggested in the given link it's also giving the below error.
Command:
--service metatool -updateLocation hdfs://XXX.XX.XXX.XX:X000 hdfs://XXX.XX.XXX.XX:X000
Initializing HiveMetaTool.. HiveMetaTool:Parsing failed. Reason:
Unrecognized option: -hiveconf
Any help in this will be really appreciated.

LOAD does not support S3. It is best practice to leave data in S3 and just use it as a Hive external table instead of copying the data to HDFS. Some references http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html and When you create an external table in Hive with an S3 location is the data transfered?

If you have installed hive on your Hadoop cluster, the default storage of hive data is HDFS (hive.metastore.warehouse.dir=/user/hive/warehouse).
As a workaround you can copy the file from S3 file system to HDFS and then from HDFS load the file to hive.
Most probably we may need to modify the parameter "hive.exim.uri.scheme.whitelist=hdfs,pfile" to load the data from S3 file system.

Related

How do I Create Hive External table on top of ECS S3 object storage using "S3a//" protocol

I am trying to create Hive external table using Beeline on top of S3 object storage using "S3a//" scheme.I have followed the official cloudera documentation and configured the below properties.
fs.s3a.access.key
fs.s3a.secret.key
fs.s3a.endpoint
I am able to run hadoop fs -Dfs.s3a.access.key=<access_key> -Dfs.s3a.secret.key=<secret_key> -Dfs.s3a.endpoint=<host_port> -ls s3a://<bucket_name>/dir/ successfully and able to see the directories. So I know my credentials, bucket access, and overall Hadoop setup is valid.
However, when I attempt to access the same s3 resources from hive(Beeline), e.g. run CREATE EXTERNAL TABLE statements using LOCATION 's3a://[bucket-name]/dir/', it fails.
Configurations
set fs.s3a.access.key=<access_key>;
set fs.s3a.secret.key=<secret_key>;
set fs.s3a.endpoint=<host:port>;
Query
CREATE EXTERNAL TABLE NAME_TEST_S3(name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TextFile LOCATION 's3a://<bucket_name>/dir/'
I am getting below error.
ERROR : FAILED: Execution error, return code 40000 from
org.apache.hadoop.hive.ql.ddl.DDLTask. MetaException(message:Got
exception: java.nio.file.AccessDeniedException <bucket_name>:
org.apache.hadooop.fs.s3a.auth.NoAuthWithAWSException: No AWS
Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.sdkClientException: Unable to load AWS Credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) (state=08S01, code=40000)
Note : I am using CDH-7.1.6 , Hive 3.1.3 and S3 object storage. I am able to access the same s3 resources using hadoop fs as well as using spark scala read api
Anyone have any idea what's missing from this equation?

error getting while creating hive table

Before creating the twitter table i added this
ADD JAR hdfs:///user/hive/warehouse/hive-serdes-1.0-SNAPSHOT.jar;
I got the following error when create the twitter table in hive:
Error while processing statement: FAILED: Execution Error, return
code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde:
com.cloudera.hive.serde.JSONSerDe
Move the Jar from HDFS to Local.
Then try to add JAR in hive terminal
Then try to use the query on Twitter Table
Ideally speaking you can add jars from both Local file system or hdfs, looks like problem could be something else here.
I would recommend to follow below sequence of steps:
List the file on hdfs to make sure it exists
hadoop fs -ls hdfs://namenode_hostname:8020/user/hive/warehouse/hive-serdes-1.0-SNAPSHOT.jar
Add the jar in the hive using full path like above and verify the
addition using list jars command in hive cli Use the serde in
hive>list jars;
create table statement with proper syntax as show here for example
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe

Error creating hive table: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException

I have a multi node hadoop cluster and now I installed hive on the namenode.
Im trying to create some hive tables from files stored in hdfs but Im getting this strange error:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:hdfs://namenode-VirtualBox:9000/data/posts
/posts.tbl is not a directory or unable to create one)
hive>
But, then I tried to create a table from a file stored in hdfs with only 2kb and the table was created with success.
But when I try to create a table from a file stored in hdfs larger like 200mb, and maybe less, I got that error.
Do you know why this error can be happening?
I believe somwhere in the code the url: hdfs://namenode-VirtualBox:9000/data/posts
/posts.tbl
is parsed and the url should not have the file suffix (.tbl) should just be ".../posts"
I refer you to: Unable to Create Table in HIVE reading a CSV from HDFS

Hive not fully honoring fs.default.name/fs.defaultFS value in core-site.xml

I have the NameNode service installed on a machine called hadoop.
The core-site.xml file has the fs.defaultFS (equivalent to fs.default.name) set to the following:
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop:8020</value>
</property>
I have a very simple table called test_table that currently exists in the Hive server on the HDFS. That is, it is stored under /user/hive/warehouse/test_table. It was created using a very simple command in Hive:
CREATE TABLE new_table (record_id INT);
If I attempt to load data into the table locally (that is, using LOAD DATA LOCAL), everything proceeds as expected. However, if the data is stored on the HDFS and I want to load from there, an issue occurs.
I run a very simple query to attempt this load:
hive> LOAD DATA INPATH '/user/haduser/test_table.csv' INTO TABLE test_table;
Doing so leads to the following error:
FAILED: SemanticException [Error 10028]: Line 1:17 Path is not legal ''/user/haduser/test_table.csv'':
Move from: hdfs://hadoop:8020/user/haduser/test_table.csv to: hdfs://localhost:8020/user/hive/warehouse/test_table is not valid.
Please check that values for params "default.fs.name" and "hive.metastore.warehouse.dir" do not conflict.
As the error states, it is attempting to move from hdfs://hadoop:8020/user/haduser/test_table.csv to hdfs://localhost:8020/user/hive/warehouse/test_table. The first path is correct because it references hadoop:8020; the second path is incorrect, because it references localhost:8020.
The core-site.xml file clearly states to use hdfs://hadoop:8020. The hive.metastore.warehouse value in hive-site.xml correctly points to /user/hive/warehouse. Thus, I doubt this error message has any true value.
How can I get the Hive server to use the correct NameNode address when creating tables?
I found that the Hive metastore tracks the location of each table. You can see the that location be running the following in the Hive console.
hive> DESCRIBE EXTENDED test_table;
Thus, this issue occurs if the NameNode in core-site.xml was changed while the metastore service was still running. Therefore, to resolve this issue the service should be restarted on that machine:
$ sudo service hive-metastore restart
Then, the metastore will use the new fs.defaultFS for newly created tables such.
Already Existing Tables
The location for tables that already exist can be corrected by running the following set of commands. These were obtained from Cloudera documentation to configure the Hive metastore to use High-Availability.
$ /usr/lib/hive/bin/metatool -listFSRoot
...
Listing FS Roots..
hdfs://localhost:8020/user/hive/warehouse
hdfs://localhost:8020/user/hive/warehouse/test.db
Correcting the NameNode location:
$ /usr/lib/hive/bin/metatool -updateLocation hdfs://hadoop:8020 hdfs://localhost:8020
Now the listed NameNode is correct.
$ /usr/lib/hive/bin/metatool -listFSRoot
...
Listing FS Roots..
hdfs://hadoop:8020/user/hive/warehouse
hdfs://hadoop:8020/user/hive/warehouse/test.db

Loading files into hive through JDBC

I'm getting this error when trying to load a file into Hive through its JDBC driver. The Hive instance is running on a vm. The file loads perfectly fine when I load it through hive commandline. The file is located on the same instance as Hive. I hope jdbc supports the load command.
java.sql.SQLException: Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:23 Invalid path ''/home/cloudera/Desktop/test.csv'': No files matching path file:/home/cloudera/Desktop/test.csv
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
at Main.main(Main.java:55)
Since hive in-turn runs in a map/reduce environment, user need to provide hdfs path for the csv file (not local path) when using hive jdbc. While running using hive cli, it takes local path as it takes care of uploading files to hdfs to load into table.

Resources