How do I Create Hive External table on top of ECS S3 object storage using "S3a//" protocol - hadoop

I am trying to create Hive external table using Beeline on top of S3 object storage using "S3a//" scheme.I have followed the official cloudera documentation and configured the below properties.
fs.s3a.access.key
fs.s3a.secret.key
fs.s3a.endpoint
I am able to run hadoop fs -Dfs.s3a.access.key=<access_key> -Dfs.s3a.secret.key=<secret_key> -Dfs.s3a.endpoint=<host_port> -ls s3a://<bucket_name>/dir/ successfully and able to see the directories. So I know my credentials, bucket access, and overall Hadoop setup is valid.
However, when I attempt to access the same s3 resources from hive(Beeline), e.g. run CREATE EXTERNAL TABLE statements using LOCATION 's3a://[bucket-name]/dir/', it fails.
Configurations
set fs.s3a.access.key=<access_key>;
set fs.s3a.secret.key=<secret_key>;
set fs.s3a.endpoint=<host:port>;
Query
CREATE EXTERNAL TABLE NAME_TEST_S3(name string, age int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TextFile LOCATION 's3a://<bucket_name>/dir/'
I am getting below error.
ERROR : FAILED: Execution error, return code 40000 from
org.apache.hadoop.hive.ql.ddl.DDLTask. MetaException(message:Got
exception: java.nio.file.AccessDeniedException <bucket_name>:
org.apache.hadooop.fs.s3a.auth.NoAuthWithAWSException: No AWS
Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.sdkClientException: Unable to load AWS Credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY)) (state=08S01, code=40000)
Note : I am using CDH-7.1.6 , Hive 3.1.3 and S3 object storage. I am able to access the same s3 resources using hadoop fs as well as using spark scala read api
Anyone have any idea what's missing from this equation?

Related

Hive with emrfs

I am importing tables from Amazon RDS to Hive using sqoop. The process is working and the data is being stored in the hive default hdfs directory : /user/hive/warehouse.
I need to change the storage location from hdfs to emrfs s3.
It is my understanding that I need to change (in hive-site.xml on the master node) value of the property hive.metastore.warehouse.dir to the s3//bucket/warehouse-location. It appears that I don't have the permission to modify the file hive-site.xml.
I am looking for some advise on how best to do it.
Sudi
You will need sudo privileges to modify the hive-site.xml file on the masternode (located in /etc/hive/conf/hive-site.xml usually).
If this is not an option, try setting this property before the cluster is started. An example with CloudFormation :
"Configurations" : [
{
"Classification" : "hive-site",
"ConfigurationProperties" : {
"hive.metastore.warehouse.dir" : "s3://your_s3_bucket/hive_warehouse/",
}
}
],
Or through the EMR dialogue in the section for "Edit Software Settings"
sudo vi /etc/hive/conf/hive-site
or
sudo -su root
vi /etc/hive/conf/hive-site.xml
If you are using hive in EMR. The hive metastore is recommended to be set in an external DB or use glue data catalog as hive metastore.
For your concern,
Create the tables you want to import as external tables in the hive. While creating the external table you will have to provide the location parameter as s3 location of your table.
Example: Suppose I have s3 bucket named bucket-xyz and I want my data to be stored in s3://bukcet-xyz/my-table location, where my table name is my-table. Then I will create my-table as an external table using hive.
CREATE EXTERNAL TABLE my-table (A VARCHAR(30), B DOUBLE(9))
ROW FORMAT DELIMITED ...
LOCATION s3://bukcet-xyz/my-table
After this when you will insert data into this table using hive . Hive will store the data in the s3 location you specified.

Issue with load data into HIVE

We have launched two EMR in AWS and installed the hadoop and hive-0.11.0 in one EMR and hive-0.13.1 other one.
Everything seems to be working fine but while trying to loading data into TABLE it's giving the below error and it happening in both the Hive Servers.
ERROR MESSAGE:
An error occurred when executing the SQL command: load data inpath
's3://buckername/export/employee_1/' into table employee_2 Query
returned non-zero code: 10028, cause: FAILED: SemanticException [Error
10028]: Line 1:17 Path is not legal
''s3://buckername/export/employee_1/'': Move from:
s3://buckername/export/employee_1 to:
hdfs://XXX.XX.XXX.XX:X000/mnt/hive_0110/warehouse/employee_2 is not
valid. Please check that values for params "default.fs.name" and
"hive.metastore.warehouse.dir" do not conflict. [SQL State=42000, DB
Errorcode=10028]
I searched for the reason and mean of this message, I found this link but when tried to execute command suggested in the given link it's also giving the below error.
Command:
--service metatool -updateLocation hdfs://XXX.XX.XXX.XX:X000 hdfs://XXX.XX.XXX.XX:X000
Initializing HiveMetaTool.. HiveMetaTool:Parsing failed. Reason:
Unrecognized option: -hiveconf
Any help in this will be really appreciated.
LOAD does not support S3. It is best practice to leave data in S3 and just use it as a Hive external table instead of copying the data to HDFS. Some references http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html and When you create an external table in Hive with an S3 location is the data transfered?
If you have installed hive on your Hadoop cluster, the default storage of hive data is HDFS (hive.metastore.warehouse.dir=/user/hive/warehouse).
As a workaround you can copy the file from S3 file system to HDFS and then from HDFS load the file to hive.
Most probably we may need to modify the parameter "hive.exim.uri.scheme.whitelist=hdfs,pfile" to load the data from S3 file system.

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

Loading files into hive through JDBC

I'm getting this error when trying to load a file into Hive through its JDBC driver. The Hive instance is running on a vm. The file loads perfectly fine when I load it through hive commandline. The file is located on the same instance as Hive. I hope jdbc supports the load command.
java.sql.SQLException: Query returned non-zero code: 10, cause: FAILED: Error in semantic analysis: Line 1:23 Invalid path ''/home/cloudera/Desktop/test.csv'': No files matching path file:/home/cloudera/Desktop/test.csv
at org.apache.hadoop.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:189)
at Main.main(Main.java:55)
Since hive in-turn runs in a map/reduce environment, user need to provide hdfs path for the csv file (not local path) when using hive jdbc. While running using hive cli, it takes local path as it takes care of uploading files to hdfs to load into table.

how to access hadoop hdfs with greenplum external table

oue datawarehouse is based on hive,now we need to transform data from hive to greenplum,we want to use external table with gphdfs,but it looks something goes wrong.
the table creating script is
CREATE EXTERNAL TABLE flow.http_flow_data(like flow.zb_d_gsdwal21001)
LOCATION ('gphdfs://mdw:8081/user/hive/warehouse/flow.db/d_gsdwal21001/prov_id=018/day_id=22/month_id=201202/data.txt')
FORMAT 'TEXT' (DELIMITER ' ');
when we run
bitest=# select * from flow.http_flow_data limit 1;
ERROR: external table http_flow_data command ended with error. sh: java: command not found (seg12 slice1 sdw3:40000 pid=17778)
DETAIL: Command: gphdfs://mdw:8081/user/hive/warehouse/flow.db/d_gsdwal21001/prov_id=018/day_id=22/month_id=201202/data.txt
our hadoop is 1.0 and greenplum is 4.1.2.1
I want to know if we need to config something about to make gp access hadoop
Have you opened the port (8081) to listen for the month_id=201202 directory?
I would double check the admin guide, I think you can use gphdfs, but not until greenplum 4.2
have you checked to ensure that java is installed on your Greenplum system? as this is required in order for gphdfs to work.

Resources