Access buckets across projects in gcp using hive - hadoop

I have two projects on my gcp account and both of them have buckets.
On one of the projects, I have a dataproc cluster on which I am running hive.
From this hive, I want to access the buckets of the other project.
I have tried giving ACL permissions to my bucket, but I still get the error when I execute a create table command from hive, saying:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: java.io.IOException Error accessing: bucket: bucketname, object: folder/filename.
How can I access my bucket using hive ?

As suggested, I used the google cloud connector, which comes pre-installed in the dataproc cluster.
https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector
The steps are precise, but in addition to that, I had to add apt roles in the bucket to my service account.
https://cloud.google.com/storage/docs/access-control/iam-roles
It then worked.

Related

AZURE_QUOTA_EXCEEDED_EXCEPTION , How to solve this?

Quota Exceeded
When trying to start warehouse in azure databricks sql analytics, the error while importing data to dashboard
You need to follow documentation on increasing corresponding quota for Azure.
But really, you need to deploy a new workspace that won't require public IPs for cluster nodes - so called secure cluster connectivity, or No Public IP (doc). Unfortunately right now you can't change this setting for existing workspace.

Creating s3 external table in amazon EMR with remote metastore

We started using Amazon EMR recently for one new project (Version emr-5.11.0).We made some architecture changes in the EMR cluster
1) We moved metastore to another Postgres instance instead of default mysql/derby
2) Running metastore service in a different instance (which is not part of amazon EMR cluster) and made the necessary changes in hive-site.xml.
In EMR
stop hive-hcatalog-server
In new instance
hive --service metastore
Everything is working as expected except 's3 external tables' .When I try to create an external s3 table it is giving us an error like below
message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
We tried with s3/s3n/s3a with credentials also for creating the external table .If we are running the metastore service inside the EMR master node and ran the same query .It is working without issues .
Do we need to do any configuration / adding additional libraries in metastore instance this to work ?
Note: The metastore instance has both Apache hadoop and hive latest binaries .We are going with HDFS filesystem .Able to perform all operations other than external s3 tables .Tried everything from beeline and hive CLI

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

Hive Create table over S3 in RIAK CS

I have Hive service running on a Hadoop cluster. I'm trying to create a Hive table over Eucalyptus(RIAK CS) S3 data. I have configured the AccessKeyID and SecretAccessKey in core-site.xml and hive-site.xml. When I execute the Create table command and specify the S3 location using s3n schema, I get the below error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:org.apache.http.conn.ConnectTimeoutException: Connect to my-bucket.s3.amazonaws.com:443 timed out)
If I try using the s3a schema, I get the below error:
FAILED: AmazonClientException Unable to load AWS credentials from any providern the chain
I could change the endpoint URL for distcp command using jets3t, but the same didnt work for Hive. Any suggestions to point Hive to Eucalyptus S3 endpoint are welcome.
I'm not familiar with Hive, but as long as I hear it uses MapReduce as backend processing system. MapReduce uses jets3t as S3 connector - changing its configuration worked for me in both MapReduce and Spark. Hope this helps: http://qiita.com/kuenishi/items/71b3cda9bbd1a0bc4f9e
Configurations like
s3service.https-only=false
s3service.s3-endpoint=yourdomain.com
s3service.s3-endpoint-http-port=8080
s3service.s3-endpoint-https-port=8080
would work for you?
I have upgraded to HDP2.3(Hadoop 2.7) and now I'm able to configure s3a schema for Hive to S3 access.

How to run hive on google cloud dataproc from within the machine?

I've just created a google cloud dataproc cluster. A few basic things are not working for me:
I'm trying to run the hive console from the master node but it fails to load with any user other than root (it looks like there's a lock, the console is just stuck).
But even when using root, I see some odd behaviour:
"show tables;" shows a table named "input"
querying the table raises an exception that this table not found.
It is not clear which user is creating the tables through the web ui. I create a job, execute it, but then don't see the results through the console.
Couldn't find any good documentation on that - does anybody have an idea on this?
Running the hive command at present is somewhat broken due to the default metastore configuration.
I recommend you use the beeline client instead, which talks to the same Hive Server 2 as Dataproc Hive Jobs. You can use it via ssh by running beeline -u jdbc:hive2://localhost:10000 on the master.
YARN applications are submitted by the Hive Server 2 as the user "nobody", you can specify a different user by passing the -n flag to beeline, but it shouldn't matter with default permissions.
This thread is a bit old but when some one search Google Cloud Platform and Hive this result is coming. So I'm adding some info which may be useful.
Currently, in order to submit job to Google dataproc, I think - like all other products - there are 3 options:
from UI
from console using command line like:
gcloud dataproc jobs submit hive --cluster=CLUSTER (--execute=QUERY, -e QUERY | --file=FILE, -f FILE) [--async] [--bucket=BUCKET] [--continue-on-failure] [--jars=[JAR,…]] [--labels=[KEY=VALUE,…]] [--params=[PARAM=VALUE,…]] [--properties=[PROPERTY=VALUE,…]] [GLOBAL-FLAG …]
REST API call like: https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit
Hope this will be useful to someone.

Resources