Configure Hadoop to use S3 requester-pays-enabled - hadoop

I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.

The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.

Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")

Related

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?
All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.
It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.
Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Hadoop distcp command using a different S3 destination

I am using a Eucalyptus private cloud on which I have set up an CDH5 HDFS. I would like to backup my HDFS to the Eucalyptus S3. The classic way to use distcp as suggested here: http://wiki.apache.org/hadoop/AmazonS3 , ie hadoop distp hdfs://namenode:9000/user/foo/data/fil1 s3://$AWS_ACCESS_KEY:$AWS_SECRET_KEY#bucket/key doesn't work.
It seems that hadoop is pre-configured with an S3 location on Amazon and I cannot find where is this configuration in order to change this to the IP address of my S3 service running on Eucalyptus. I would expect to be able to just change the uri of S3 in the same way you can change your NameNode uri when using an hdfs:// prefix. But is seems this is not possible... Any insights?
I have already found workarounds for transferring my data. In particular the s3cmd tools here: https://github.com/eucalyptus/eucalyptus/wiki/HowTo-use-s3cmd-with-Eucalyptus and the s3curl scripts here: aws.amazon.com/developertools/Amazon-S3/2880343845151917 work just fine but I would prefer if I could transfer my data using map-reduce with the distcp command.
It looks like hadoop is using the jets3t library for S3 access. You might be able to use the configuration described in this blog to access eucalyptus, but note that for version 4 onwards the path is "/services/objectstorage" rather than "/services/Walrus".

How To Sync the contents to all nodes

I was following this tutorial trying to install and configure Spark on my cluster..
My cluster (5 nodes) is hosted on AWS and installed from Cloudera Manager.
It is mentioned in the tutorial that "Sync the contents of /etc/spark/conf to all nodes." after the modification of the configuration file.
I am really wondering what is the easies way to make that happen. I read a post that has a similar question like mine HERE. Based on my understanding, for the configuration files of hadoop, hdfs ...etc. which are monitored by zookeeper or cloudera manager. That might be the case to use CM deploy or zookeeper to make it happen.
However, Spark's configuration file is totally outside zookeeper's scope. How can I "sync" to other nodes..
Many thanks!
Why don't you mount the same EBS via NFS to /etc/spark/conf or one of it's parents so the files are automatically synced?

Access hdfs from outside hadoop

I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).

Resources