Connect hadoop cluster to mutiple Google Cloud Storage backets in multiple Google Projects - hadoop

It is possible, to connect my Hadoop cluster to multiple Google Cloud Projects at once ?
I can easly use any Google Storage bucket in single Google Project via Google Cloud Storage Connector as explained in this thread Migrating 50TB data from local Hadoop cluster to Google Cloud Storage. But i can't find any documentation or example how to connect to two or more Google Cloud Project from single map-reduce job. Do You have any suggestion/trick ?
Thanks a lot.

Indeed, it is possible to connect your cluster to buckets from multiple different projects at once. Ultimately, if you're using the instructions for using a service-account keyfile, the GCS requests are performed on behalf of that service-account, which can be treated more-or-less like any other user. You can either add the service account email your-service-account-email#developer.gserviceaccount.com to all the different cloud projects owning buckets you want to process, using the permissions section of cloud.google.com/console and simply adding that email address like any other member, or you can set GCS-level access to add that service-account like any other user.

Related

How can I securely transfer my data from on-prem HDFS to Google Cloud Storage?

I have a bunch of data in an on-prem HDFS installation. I want to move some of it to Google Cloud (Cloud Storage) but I have a few concerns:
How do I actually move the data?
I am worried about moving it over the public internet
What is the best way to move data securely from my HDFS store to Cloud Storage?
To move Data from an on-premise Hadoop cluster to Google Cloud Storage, you should probably use the Google Cloud Storage connector for Hadoop. You can install the connector in any cluster by following the install directions. As a note, Google Cloud Dataproc clusters have the connector installed by default.
Once the connector is installed, you can use DistCp to move the data from your HDFS to Cloud Storage. This will transfer data over the (public) internet unless you have a special interlink setup with Google Cloud. To this end, you can use a squid proxy and configure the Cloud Storage connector to use it.

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

Loadbalancing settings via spring AWS libraries for multiple RDS Read Only Replicas

If there are multiple read replicas, where load balancing related settings can be specified when using spring AWS libraries.
Read replicas have their own endpoint address similar to the original RDS instance. Your application will need to take care of using all the replicas and to switch between them. You'd need to introduce this algorithm into your application so it automatically detects which RDS instance it should connect to in turn. The following links can help:
http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.Replication.html#Overview.ReadReplica

Downloading files from Google Cloud Storage straight into HDFS and Hive tables

I'm working on Windows command line as problems with Unix and firewalls prevent gsutil from working. I can read my Google Cloud Storage files and copy them over to other buckets (which I don't need to do). What I'm wondering is how to download them directly into HDFS (which I'm 'ssh'ing into)? Has anyone done this? Ideally this is part one, part two is creating Hive tables for the Google Cloud Storage data so we can use HiveQL and Pig.
You can use the Google Cloud Storage connector which provides an HDFS-API compatible interface to your data already in Google Cloud Storage, so you don't even need to copy it anywhere, just read from and write directly to your Google Cloud Storage buckets/objects.
Once you set up the connector, you can also copy data between HDFS and Google Cloud Storage with the hdfs tool, if necessary.

How can I change the default bucket of a google-cloud-based hadoop-enable cluster after I created it?

After I create a google-cloud-based hadoop-enable cluster, I want to change the default bucket to a different one, how can I do that? I can't find the answer in google cloud doscumentation. Thanks!
Did you create a cluster by hand, using bdutil, using Cloud Dataproc or through some other means?
bdutil
If you used bdutil, see the choose a default file system section in the setup documentation.
Cloud Dataproc
If you used Cloud Dataproc, you can access any bucket to which your project has permission by using the gs:// uri. If you want to connect your cluster to a new bucket for logs, you will have to create a new cluster, unfortunately.
Other method
If you used a different method, like the "click to deploy" launcher, I recommend you give Dataproc or bdutil a try.

Resources