Using s3 as fs.default.name or HDFS? - amazon-ec2

I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?
Thanks!

in order to use S3 instead of HDFS fs.name.default in core-site.xml needs to point to your bucket:
<property>
<name>fs.default.name</name>
<value>s3n://your-bucket-name</value>
</property>
It's recommended that you use S3N and NOT simple S3 implementation, because S3N is readble by any other application and by yourself :)
Also, in the same core-site.xml file you need to specify the following properties:
fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey
fs.s3n.awsSecretAccessKey

Any intermediate data of your job goes to HDFS, so yes, you still need a namenode and datanodes

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml
fs.default.name is deprecated, and maybe fs.defaultFS is better.

I was able to get the s3 integration working using
<property>
<name>fs.default.name</name>
<value>s3n://your-bucket-name</value>
</property>
in the core-site.xml and get the list of the files get using hdfs ls command.but should also should have namenode and separate datanode configurations, coz still was not sure how the data gets partitioned in the data nodes.
should we have local storage for namenode and datanode?

Related

Which processes need access to core-site.xml and hdfs-site.xml

The core-site.xml file informs Hadoop daemon where NameNode runs in
the cluster. It contains the configuration settings for Hadoop Core
such as I/O settings that are common to HDFS and MapReduce.
The hdfs-site.xml file contains the configuration settings for HDFS
daemons; the NameNode, the Secondary NameNode, and the DataNodes.
Here, we can configure hdfs-site.xml to specify default block
replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
I'm looking to understand which processes [Namenode, Datanode, HDFS client] need access to which of those configuration files?
Namenode: I presume it only needs hdfs-site.xml because it doesn't need to know its own location.
Datanode: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
HDFS client: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
Is that accurate?
The clients and server processes need access to both files
If you use HDFS nameservices with highly available Namenodes, then the two Namenodes need to find each other
Some comments:
core-site.xml hdfs-site.xml Are the two used by external
programs (such as NiFi) to access the cluster/WEB HDFS API
Edge nodes require both for cluster access
Ambari will manage both of these along with all the others
The three you listed all need access in order to run the cluster and at a bare minimum set basic settings such as proxy settings and cluster access

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.
I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.
What is a recommended way of moving data from local Hadoop cluster to GCS ?
In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?
What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?
Thanks a lot,
Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.
Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:
cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/
You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar
Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:
cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12
Add the following entries to your core-site.xml file in your Hadoop conf dir:
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>your-ascii-google-project-id</value>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value>some-bucket-your-project-owns</value>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>your-service-account-email#developer.gserviceaccount.com</value>
</property>
<property>
<name>fs.gs.auth.service.account.keyfile</name>
<value>/path/to/hadoop/conf/gcskey.p12</value>
</property>
The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.
If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.
Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.
Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java
Look like few property names are changed in recent versions.
`String serviceAccount = "service-account#test.gserviceaccount.com";
String keyfile = "/path/to/local/keyfile.p12";
hadoopConfiguration.set("google.cloud.auth.service.account.enable", true);
hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount);
hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`

does configuration properties on hdfs-site.xml applies to NameNode in hadoop?

I recently set up a test environment cluster for hadoop -One master and two slaves.
Master is NOT a dataNode (although some use master node as both master and slave).
So basically I have 2 datanodes. The default configuration for replication is 3.
Initially, I did not change any configuration on conf/hdfs-site.xml. I was getting error could only be replicated to 0 nodes instead of 1.
I then changed the configuration in conf/hdfs-site.xml in both my master and slave as follows:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
and lo! everything worked fine.
My question is: does this configuration applies to NameNode or DatNode although I changed hdfs-site.xml in all my datanodes and NameNodes.
if my understanding is correct, NameNode allocates the block for datanodes. so replication configuration in master or NameNode is important and probably not needed in datanodes. Is this correct?
I am confused with the actual purpose of different xml in hadoop framework: from my little understanding:
1) core-site.xml - configuration parameters for the entire framework, such as where the logs files should go, what is the default name of the filesystem etc
2) hdfs-site.xml - applies to individual datanodes. how many replication, data dir in the local filesystem of the datanode, size of the block etc
3) mapred-site.xml - applies to datanode and gives configuration for the task tracker.
please correct if this is wrong. These configuration files are not well explained in the tutorials I had. so it comes from my look into these files in the defaults.
This is my understanding and I may be wrong.
{hdfs-site.xml} - is to for the properties of HDFS(Hadoop Distributed File System)
{mapred-site.xml} - is to for the properties of MapReduce
{core-site.xml} - is for other properties which touch both HDFS and MapReduce
this is usually caused by insufficient space.
please check the total capacity of your cluster and used, remaining ratio using
hdfs dfsadmin -report
also check dfs.datanode.du.reserved in the hdfs-site.xml, if this value is larger than your remained capacity
look for other possible causes explained here

Hadoop HA Namenode remote access

Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)
You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>
You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.
for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock
There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set

Simulating Map-reduce using Cloudera

I want to use cloudera to simulate Hadoop job on a single machine (of course with many VMs). I have 2 question
1) Can I change the replication policy of HDFS in cloudera?
2) Can I see cpu usage of each VMs?
You can use hadoop fs -setrep to change the replication factor on any file. You can also change the default replication factor by modifying hdfs-site.xml by adding the following:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
You'll have to log into each box and use top to see the cpu usage of each VM. There is nothing out of the box in Hadoop that lets you see this.
I found out that I can change data replication policy by changing "ReplicationTargetChooser.java".

Resources