Hive with emrfs - hadoop

I am importing tables from Amazon RDS to Hive using sqoop. The process is working and the data is being stored in the hive default hdfs directory : /user/hive/warehouse.
I need to change the storage location from hdfs to emrfs s3.
It is my understanding that I need to change (in hive-site.xml on the master node) value of the property hive.metastore.warehouse.dir to the s3//bucket/warehouse-location. It appears that I don't have the permission to modify the file hive-site.xml.
I am looking for some advise on how best to do it.
Sudi

You will need sudo privileges to modify the hive-site.xml file on the masternode (located in /etc/hive/conf/hive-site.xml usually).
If this is not an option, try setting this property before the cluster is started. An example with CloudFormation :
"Configurations" : [
{
"Classification" : "hive-site",
"ConfigurationProperties" : {
"hive.metastore.warehouse.dir" : "s3://your_s3_bucket/hive_warehouse/",
}
}
],
Or through the EMR dialogue in the section for "Edit Software Settings"

sudo vi /etc/hive/conf/hive-site
or
sudo -su root
vi /etc/hive/conf/hive-site.xml

If you are using hive in EMR. The hive metastore is recommended to be set in an external DB or use glue data catalog as hive metastore.
For your concern,
Create the tables you want to import as external tables in the hive. While creating the external table you will have to provide the location parameter as s3 location of your table.
Example: Suppose I have s3 bucket named bucket-xyz and I want my data to be stored in s3://bukcet-xyz/my-table location, where my table name is my-table. Then I will create my-table as an external table using hive.
CREATE EXTERNAL TABLE my-table (A VARCHAR(30), B DOUBLE(9))
ROW FORMAT DELIMITED ...
LOCATION s3://bukcet-xyz/my-table
After this when you will insert data into this table using hive . Hive will store the data in the s3 location you specified.

Related

How do I Create Hive External table on top of ECS S3 object storage using "S3a//" protocol

I am trying to create Hive external table using Beeline on top of S3 object storage using "S3a//" scheme.I have followed the official cloudera documentation and configured the below properties.
fs.s3a.access.key
fs.s3a.secret.key
fs.s3a.endpoint
I am able to run hadoop fs -Dfs.s3a.access.key=<access_key> -Dfs.s3a.secret.key=<secret_key> -Dfs.s3a.endpoint=<host_port> -ls s3a://<bucket_name>/dir/ successfully and able to see the directories. So I know my credentials, bucket access, and overall Hadoop setup is valid.
However, when I attempt to access the same s3 resources from hive(Beeline), e.g. run CREATE EXTERNAL TABLE statements using LOCATION 's3a://[bucket-name]/dir/', it fails.
Configurations
set fs.s3a.access.key=<access_key>;
set fs.s3a.secret.key=<secret_key>;
set fs.s3a.endpoint=<host:port>;
Query
CREATE EXTERNAL TABLE NAME_TEST_S3(name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TextFile LOCATION 's3a://<bucket_name>/dir/'
I am getting below error.
ERROR : FAILED: Execution error, return code 40000 from
org.apache.hadoop.hive.ql.ddl.DDLTask. MetaException(message:Got
exception: java.nio.file.AccessDeniedException <bucket_name>:
org.apache.hadooop.fs.s3a.auth.NoAuthWithAWSException: No AWS
Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.sdkClientException: Unable to load AWS Credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) (state=08S01, code=40000)
Note : I am using CDH-7.1.6 , Hive 3.1.3 and S3 object storage. I am able to access the same s3 resources using hadoop fs as well as using spark scala read api
Anyone have any idea what's missing from this equation?

Where hive stores table locally?

I have created a hive table and trying to locate where hive have created an hdfs file for this table locally. The Hive version is 2.3.0.
I tried this command to retrieve the location of my table
hive> describe formatted table_name;
I got this as an output(only showing relevant output! tb2 is the table_name in this case)
Location: hdfs://localhost:54310/user/hive/warehouse/tb2
I have no clue how to redirect to hdfs://localhost:54310 locally(from terminal). Also the table is not present in hadoop default directory.
Try running the below command to view the hive table. In the output you will find a folder by your tablename
hadoop dfs -ls /user/hive/warehouse/tb2
A table in hive is basically a folder in hdfs.

Export data into Hive from a node without Hadoop(HDFS) installed

Is it possible to export data from a node that has not hadoop(HDFS) or Sqoop installed to a Hive server?
I would read the data from a source which could be Mysql or just files in some directory and then use the Hadoop core classes or something like Sqoop to export the data into my Hadoop cluster.
I am programming in Java.
Since you are final destination is a hive table. I would suggest the following :
Create a hive final table.
use the following command to load data from the other node
LOAD DATA LOCAL INPATH '<full local path>/kv1.txt' OVERWRITE INTO TABLE table_name;
refer this
Using Java , You could use JSCH lib to invoke these shell commands or so .
Hope this helps.

Hive not fully honoring fs.default.name/fs.defaultFS value in core-site.xml

I have the NameNode service installed on a machine called hadoop.
The core-site.xml file has the fs.defaultFS (equivalent to fs.default.name) set to the following:
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop:8020</value>
</property>
I have a very simple table called test_table that currently exists in the Hive server on the HDFS. That is, it is stored under /user/hive/warehouse/test_table. It was created using a very simple command in Hive:
CREATE TABLE new_table (record_id INT);
If I attempt to load data into the table locally (that is, using LOAD DATA LOCAL), everything proceeds as expected. However, if the data is stored on the HDFS and I want to load from there, an issue occurs.
I run a very simple query to attempt this load:
hive> LOAD DATA INPATH '/user/haduser/test_table.csv' INTO TABLE test_table;
Doing so leads to the following error:
FAILED: SemanticException [Error 10028]: Line 1:17 Path is not legal ''/user/haduser/test_table.csv'':
Move from: hdfs://hadoop:8020/user/haduser/test_table.csv to: hdfs://localhost:8020/user/hive/warehouse/test_table is not valid.
Please check that values for params "default.fs.name" and "hive.metastore.warehouse.dir" do not conflict.
As the error states, it is attempting to move from hdfs://hadoop:8020/user/haduser/test_table.csv to hdfs://localhost:8020/user/hive/warehouse/test_table. The first path is correct because it references hadoop:8020; the second path is incorrect, because it references localhost:8020.
The core-site.xml file clearly states to use hdfs://hadoop:8020. The hive.metastore.warehouse value in hive-site.xml correctly points to /user/hive/warehouse. Thus, I doubt this error message has any true value.
How can I get the Hive server to use the correct NameNode address when creating tables?
I found that the Hive metastore tracks the location of each table. You can see the that location be running the following in the Hive console.
hive> DESCRIBE EXTENDED test_table;
Thus, this issue occurs if the NameNode in core-site.xml was changed while the metastore service was still running. Therefore, to resolve this issue the service should be restarted on that machine:
$ sudo service hive-metastore restart
Then, the metastore will use the new fs.defaultFS for newly created tables such.
Already Existing Tables
The location for tables that already exist can be corrected by running the following set of commands. These were obtained from Cloudera documentation to configure the Hive metastore to use High-Availability.
$ /usr/lib/hive/bin/metatool -listFSRoot
...
Listing FS Roots..
hdfs://localhost:8020/user/hive/warehouse
hdfs://localhost:8020/user/hive/warehouse/test.db
Correcting the NameNode location:
$ /usr/lib/hive/bin/metatool -updateLocation hdfs://hadoop:8020 hdfs://localhost:8020
Now the listed NameNode is correct.
$ /usr/lib/hive/bin/metatool -listFSRoot
...
Listing FS Roots..
hdfs://hadoop:8020/user/hive/warehouse
hdfs://hadoop:8020/user/hive/warehouse/test.db

Hive doesn't show tables when started from another directory

I installed Hive cdh4 on RHEL. Whenever I start Hive from a directory, it creates metastore_db dir in it and a derby.log file. Is it a normal behaviour? Moreover, when I create a table, starting Hive from a particular directory; I'm unable to see that table when I start Hive from a directory, other than that.
For example,
Let's say I started Hive from my home dir, i.e. $HOME or ~ and I create table in Hive. But when I start Hive from /path/to/my/Hive/directory and do a show tables, the table i just creted wouldn't show up. However, if start Hive from my home directory again and look for tables, I'm able to see the table.
Also, if I make some changes in hive-site.xml, they are simply being ignored by Hive.
Please help me where am I going wrong.
You can change this and use one metastore_db by updating "$HIVE_HOME/conf/hive-default.xml" file's "javax.jdo.option.ConnectionURL" as below:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/path/to/my/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
Where /path/to/my/metastore_db is the location you want to keep your meta store dB.

Resources