Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS? - hadoop

I'm exploring the options for running a hadoop application on a local system.
As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.
The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.
The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.
So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?
If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.
Has anyone tried this before?
Can it work or is this idea much too far off the intended use?
Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?
Thanks for your insights.
EDIT 2:
This is the config I created for hadoop 0.18.3
conf/hadoop-site.xml using the answer provided by bajafresh4life.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:33301</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>localhost:33302</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>localhost:33303</value>
<description>
The task tracker http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
</configuration>

Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.
Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...
We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

Related

Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.
I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.
What is a recommended way of moving data from local Hadoop cluster to GCS ?
In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?
What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?
Thanks a lot,
Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.
Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:
cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/
You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar
Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:
cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12
Add the following entries to your core-site.xml file in your Hadoop conf dir:
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
<name>fs.gs.project.id</name>
<value>your-ascii-google-project-id</value>
</property>
<property>
<name>fs.gs.system.bucket</name>
<value>some-bucket-your-project-owns</value>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>your-service-account-email#developer.gserviceaccount.com</value>
</property>
<property>
<name>fs.gs.auth.service.account.keyfile</name>
<value>/path/to/hadoop/conf/gcskey.p12</value>
</property>
The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.
If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.
Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.
Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java
Look like few property names are changed in recent versions.
`String serviceAccount = "service-account#test.gserviceaccount.com";
String keyfile = "/path/to/local/keyfile.p12";
hadoopConfiguration.set("google.cloud.auth.service.account.enable", true);
hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount);
hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`

Hadoop HA Namenode remote access

Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)
You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>
You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.
for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock
There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set

Coexistance of Hadoop MR1 and MR2

Is is possible to run both Hadoop MR1 and MR2 together in same cluster (at least, in theory)?
If yes, how can I do that?
In theory, you can do as:
run DataNode TaskTracker and NodeManager on one machine
run NameNode SecondaryNameNode and ResourceManager on other machines
all processes with different ports
but, not suggest to do this, see cloudera blog:
"Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment."
In theory, yes.
Unpack the tarball into 2 different locations, owned by different users.
In both of them, change all mapred/yarn related ports to mutually exclusive sets.
Run the datanodes from only one of the locations.
Start mapred/yarn related daemons in both locations
Do post here if it works.
Also dfs name dir and data dir should be different for MR1 and MR2.
<property>
<name>dfs.name.dir</name>
<value>/home/userx/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/userx/hdfs/data</value>
</property>
It seems for Mapr, this is not only a theory but practice, check this link.
You dont need to run both, just run the Hadoop 2.0, it provides full backward compatibility to MapReduce applications written for Hadoop 1.0.
There are few minor changes in API, please look at the link to check if any changes effect your applications.

In Hadoop , how to get the instance of currently running Jobtracker?

I am working on a Monitoring Tool for Hadoop. I need to get the currently running jobtracker. How can I get that?
Check out the <hadoopdir>/conf/mapred-site.xml configuration file.
In this file, you should find a <property> that has a <name> of mapred.job.tracker:
<property>
<name>mapred.job.tracker</name>
<value>node5:12345</value>
<property>
This tells you what node it is running on and what port it is running on.
If you are looking for any specific information, please elaborate in your original question.
I believe the closest you can get at this time is using the JobClient class: here
This will allow you to see running jobs, or walk all jobs.

Resources