Hive Create table over S3 in RIAK CS - hadoop

I have Hive service running on a Hadoop cluster. I'm trying to create a Hive table over Eucalyptus(RIAK CS) S3 data. I have configured the AccessKeyID and SecretAccessKey in core-site.xml and hive-site.xml. When I execute the Create table command and specify the S3 location using s3n schema, I get the below error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:org.apache.http.conn.ConnectTimeoutException: Connect to my-bucket.s3.amazonaws.com:443 timed out)
If I try using the s3a schema, I get the below error:
FAILED: AmazonClientException Unable to load AWS credentials from any providern the chain
I could change the endpoint URL for distcp command using jets3t, but the same didnt work for Hive. Any suggestions to point Hive to Eucalyptus S3 endpoint are welcome.

I'm not familiar with Hive, but as long as I hear it uses MapReduce as backend processing system. MapReduce uses jets3t as S3 connector - changing its configuration worked for me in both MapReduce and Spark. Hope this helps: http://qiita.com/kuenishi/items/71b3cda9bbd1a0bc4f9e
Configurations like
s3service.https-only=false
s3service.s3-endpoint=yourdomain.com
s3service.s3-endpoint-http-port=8080
s3service.s3-endpoint-https-port=8080
would work for you?

I have upgraded to HDP2.3(Hadoop 2.7) and now I'm able to configure s3a schema for Hive to S3 access.

Related

How to send beeline output to sqoop

I am struggling to send beeline output to apache sqoop tool. I guess Apache sqooop can read data from where data sits on Hadoop cluster.But beeline can query data and output the data into where hadoop client is running.
Is it possible to send beeline output directly to hadoop cluster or instruct apache sqoop to read data from machine where hadoop client not installed.
Could you please elaborate on the requirement?
I assume that you are running a query on hive via beeline. Do you want to move the result into same cluster HDFS storage or a different cluster/FS ?
If it's in the same cluster, then you can create an external table and run a select insert query. If not, please elaborate on the usecase.

Connecting Apache Superset with Hive

I've my Hadoop cluster running in AWS environment where the schema got mapped with Hive. And I could see the complete Data in Hive.
Now, Here is the Problem - I am trying to connect my hive to Superset where I couldn't able to connect with.
This is how I have provided my URI:
jdbc+hive://MYIP:PORT
Also tried:
hive://username:password#MYIP:PORT
Make sure hive server2 is up and running
Also you can try this one
hive://hostname:10000/default?auth=NOSASL

Access buckets across projects in gcp using hive

I have two projects on my gcp account and both of them have buckets.
On one of the projects, I have a dataproc cluster on which I am running hive.
From this hive, I want to access the buckets of the other project.
I have tried giving ACL permissions to my bucket, but I still get the error when I execute a create table command from hive, saying:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException Error accessing: bucket: bucketname, object: folder/filename.
How can I access my bucket using hive ?
As suggested, I used the google cloud connector, which comes pre-installed in the dataproc cluster.
https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector
The steps are precise, but in addition to that, I had to add apt roles in the bucket to my service account.
https://cloud.google.com/storage/docs/access-control/iam-roles
It then worked.

Creating s3 external table in amazon EMR with remote metastore

We started using Amazon EMR recently for one new project (Version emr-5.11.0).We made some architecture changes in the EMR cluster
1) We moved metastore to another Postgres instance instead of default mysql/derby
2) Running metastore service in a different instance (which is not part of amazon EMR cluster) and made the necessary changes in hive-site.xml.
In EMR
stop hive-hcatalog-server
In new instance
hive --service metastore
Everything is working as expected except 's3 external tables' .When I try to create an external s3 table it is giving us an error like below
message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
We tried with s3/s3n/s3a with credentials also for creating the external table .If we are running the metastore service inside the EMR master node and ran the same query .It is working without issues .
Do we need to do any configuration / adding additional libraries in metastore instance this to work ?
Note: The metastore instance has both Apache hadoop and hive latest binaries .We are going with HDFS filesystem .Able to perform all operations other than external s3 tables .Tried everything from beeline and hive CLI

How to start spark (with thrift server) in non-blocking mode that hive can update and reload data into spark (table-looking)

We do have problems with table lookings. We need simultanious access from hive and spark (with thrift server) to tables. However our problem is running spark with thrift server result in a table looking.
We're running on an Amazon AWS EMR Cluster with Hive, Spark and thrift Server 2.
We'd like to update with hive an s3 storage and load this aggregated data into spark in background periodically. Spark meanwhile is allways on with thrift server loaded and has the same data loaded from s3, to do realtime aggregations on this data. Spark does not need write access on this data.
The problem is running the periodicall data-loading tasks on hive result in freeze of the job.
We think the meta-store may be locked by spark / thrift server, blocking hive from updating and reloading data into spark. (But not sure about this)
Is it possible to start spark and thrift server in read only non-blocking mode?
What may cause the problem? Anyone experienced similar problems?
How is your metastore configured ? Does it use Derby for the metastore ?
With the default configuration it uses Derby, which does not support multiple concurrent users.
If so, you should change it to use something like MySQL, which does support multiple users.

Resources