Various job statistics using yarn and hadoop 2.2.0 - hadoop

I've recently installed a 2-node hadoop 2.2.0 using the new yarn framework.
The jobs run and everything looks dine, but I wanted to know if there is a way to actually verify both nodes are running the job and not just one (I can't seem to find any relevant information about this matter in the hadoop jar ... commands' output, where the mapreduce completion statistic is displayed.
I've also wanted to know how can I verify both nodes are storing information for the DFS. I ran df and it seems only ONE node is actually storing information (I've hadoop dfs -put big text files).
So, in short:
How can I tell which nodes actually ran a specific job?
How can I tell which datanodes actually hold what information (I use replication = 2 to make sure both nodes share the load of information I've put on the DFS, after reading some tutorials).
It's really hard for me to Google this specifically because Hadoop isn't as covered as other topics I'm used to Google and most threads I end up running into are unanswered or irrelevant.
Thanks

You'll need to check the Job Tracker Web UI (port 50030) - from here you can list the number of active Task Trackers as well as the number of map tasks they have both run (and completed, failed + errored).
You can use a command line tool to list the blocks and their locations:
hadoop fsck <path> -files -blocks -locations
See this link for more info on the fsck cmd: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/CommandsManual.html#fsck

Related

How does Hadoop distribute the data/tasks for MapReduce jobs?

I've setup a Hadoop cluster with 4 nodes, one of which serves as the NameNode for HDFS as well as the Yarn master. This node is also the most powerful.
Now, I've distributed 2 text files, one on the node01 (namenode) and one on node03 (datanode). When running the basic WordCount MapReduce job, I can see in the logs that only node01 was doing any calculations.
My question is why Hadoop didn't decide to do MapReduce on node03 and transfer the result instead of transferring the entire book to node01. I also checked, duplication is disabled and the book is only available on node03.
So, how does Hadoop decide between transferring the data and setting up the jobs and in this decision, does it check which machine has more compute power (e.g. did it decide to transfer to node01 because node01 is a 4 core 4gig ram machine vs 2core 1 gig on node03)?
I couldn't find anything on this topic, so any guidance would be appreciated.
Thank you!
Some more clarifications:
node01 is running a NameNode as well as a DataNode and a ResourceManager as well as a NodeManager. Thus, it serves as "main node" as well as a "compute node".
I made sure to put one file on node01 and one file on node03 by running:
hdfs dfs -put sample1.txt samples on node01 and hdfs dfs -put sample02.txt samples on node03. As replication is disabled, this leads to the data - that was available locally on node01 respective node03 - only being stored there.
I verified this using the HDFS Webinterface. For sample1.txt, it says the blocks are only available on node01; for sample2.txt, it says the blocks are only available on node03.
Regarding #cricket_007:
My concern is that sample2.txt is only available on node03. The YARN Webinterface tells me that that for the Application Attempt, only one container was allocated on node01. If the map task for file sample2.txt, there would have been a container on node03 as well.
Thus, node01 needs to have fetched the sample2.txt file from node03.
Yes, I know Hadoop is not running well on 1gig of RAM, but I am working with a Raspberry Pi cluster just to fiddle around and learn a little. This is not for production usage.
The YARN application master picks a node at random to run the calculation based on information available from the Namenode where files are stored. DataNodes and NodeManagers should run on the same machines.
If your file isn't larger than the HDFS block size, there is no reason to fetch the data from other nodes.
Note: Hadoop services don't run that well on only 1G of RAM, and you need to adjust the YARN settings differently for different sized nodes.
For anyone else wondering:
At least for me, the HistoryServer UI (which needs to be started manually) shows correctly that node03 and node01 were running map jobs. Thus, my statement was incorrect. I still wonder why the application attempt UI speaks of one container, but I guess that doesn't matter.
Thank you guys!

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

Hadoop ResourceManager not show any job's record

I install Hadoop MultiNode cluster based on this link http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
then I try to run wordcount example in my environment, but when I access to Resource Manager http://HadoopMaster:8088 to see the job's details, no records show in UI.
I also search this problem, one guy give the solution like that Hadoop is not showing my job in the job tracker even though it is running but in my case, I'm just running hadoop's example, in which wordcount also didn't add any extra configuration for yarn.
Anyone has install successfully Hadoop2 Muiltinode and Hadoop web UI works correctly can help me about this issue or can give a link to install correctly.
Whether you got the output of word-count job?

Can I use hadoop to run multiple web servers?

I am not sure about what hadoop can and cannot do, and how easy things are.
I understand hadoop is good at doing mapreduce jobs and at providing hdfs, their distributed filesystem.
What else is hadoop good at / easy to use ?
My problem : I would like to serve data, result of mapreduce. And as I have lot of traffic I would need 3 front end servers. Can Hadoop help me deploy a server on 3 of my n runnning nodes ?
Basically instead of running mapreduce on n machines, I would like to run a custom executable (my server) on 3 machines. And when 1 machine fails, that hadoop takes care of starting the job on another available machine.
Am I supposed to run that on the hadoop cluster ? or should the hadoop cluster be used only for the mapreduce and I should have a separate cloud to serve the data from the hadoop cluster ?
Thanks for sharing your experience.
P.S I am just considering hadoop right now as a solution, Im not tied to it
Your question isn't actually clear but here is my shot.
You want to display the result of your Hadoop job? Usually a Hadoop job writes its result to HDFS. What you can do is to create your own OutputFormat class. You might define a XMLOutputFormat for example.
But the nice thing is that you can create your own Writable. Take a look at Database Access with Apache Hadoop. In this tutorial you can save the output of a Hadoop job to a data base system.
Your frontend then can query the database and show the result.

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Resources