Debugging hadoop applications - hadoop

I tried printing out values using System.out.println(), but they won't appear on the console. How do i print out the values in a map/reduce application for debugging purposes using Hadoop?
Thanks,
Deepak.

The page #SquareCog points to is a very good source of information on debugging a MapReduce job once you are running it on a cloud.
Before you reach that point though you should consider writing unit tests for your mappers and reducers, so you can verify that the basic logic works. If you are interested in unit tests to test drive your map and reduce logic check out mrunit, which works in a similar fashion to JUnit.

Lines printed out using System.out.println and System.err.println go to the logs that Hadoop writes on a per-job basis. You can view them using the web UI by clicking through to individual tasks.
See this blog post from Cloudera for a lot of advice on testing and debugging Hadoop jobs: http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/

I'd like to add, that you can use Counters Hadoop JavaDoc for debugging as well. E.g. if you want to check, how often a certain line in your program is passed.

System.out.println is not directed to the terminal but rather written to the logs files. Log files are situated in your Hadoop directory by default. So you can type in the terminal
cat /hadoop-homedirectory/logs/userlogs/jobID/attemptID/stdout
You can also access your via the web interface. Type ip:50070/logs/userlogs.

If you use MRUnit & ApprovalTests it is very easy to print out the process of a map reduce job.
There's a video on it here: http://t.co/leExFVrf
For example the code
HadoopApprovals.verifyMapReduce(new WordCountMapper(),
new WordCountReducer(), 0, "cat cat dog");
Will produce the output:
[cat cat dog]
-> maps via WordCountMapper to ->
(cat, 1)
(cat, 1)
(dog, 1)
-> reduces via WordCountReducer to ->
(cat, 2)
(dog, 1)

Related

MapReduce Code for executing .ds2 scoring file in hadoop cluster

I have a .ds2 file created from SAS scoring accelerator .
The sas scoring accelerator has a macro %INDHD_RUN_MODEL which creates a MapReduce code internally to execute the .ds2 file.
How can I see the code or how can I create a similar mapreduce code.
So that I can schedule the scoring through oozie.
The %indhd_run_model macro is pre-compiled and stored in SASHOME\SASFoundation\9.4\accelmvahadp\cmacros. It appears that the developers of this code want to keep the contents hidden.
libname macros 'C:\SAS\SASHome\SASFoundation\9.4\accelmvahadp\cmacros';
options sasmstore=macros mstored;
%COPY indhd_run_model / source;
NOTE: The macro %INDHD_RUN_MODEL was compiled with the /SECURE option. No output will be produced for this %COPY statement.
ERROR: The /SOURCE option was not specified when the macro
INDHD_RUN_MODEL was compiled.
Because the /secure option is enabled, it is not possible to specifically view the Hadoop code template; however, you can get better insight by enabling the following options in SAS:
options mlogic mlogicnest sastrace=',,,d' sastraceloc=saslog nostsuffix;
This will turn on a few things:
The actual logic of different parts of the macro being run
The HiveQL queries being sent by SAS, and the response from Hadoop; output is in the SAS log
You can also play around with the arguments of the sastrace option to get different types of information.
Worst case, you'll need to contact SAS Tech Support to get more help on the issue. I would recommend calling, as you'll have a much faster response.
You can schedule %INDHD_RUN_MODEL, but can't schedule the map-reduce jobs directly.SAS EP(Embeded Process) translate .ds2 code to map-reduce jobs, and then excuted it. The whole process should be a black box for end users.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Hadoop - sharing files between multiple jobs in a chain

I have written a map-reduce application that consists of two map-reduce phases.
binary input file -> m1-> r1 -> m2 -> r2 -> text output
The input file to my application contains a small chunk of data (<1k) that is needed by the second reducer (r2). I have written a custom record reader that extracts this data, but then how do I pass this along to the next job? It seems like this is a job for DistributedCache, but it appears that DistributedCache cache files are scoped to a single job's scratch space. What is the best way to share small data between different jobs in the same chain?
Try hadoop with -files option
I had similar issue in past and -files option worked for me.
Have a look here

Variable/looping sequence of jobs

I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence.
E.g.:
Job 1: Map source data into X levels.
Job 2: MapReduce Level1 -> appends to Level2
Job 3: MapReduce Level2 -> appends to LevelN
Job N: MapReduce LevelN -> appends to LevelN+1
And so on until the final level. The key is that each level must include its own specific source data as well as the results of the previous level.
I've looked at pig, hive, hamake, and cascading, but have yet to see clear support for something like this.
Does anyone know an efficient way of accomplishing this? Right now I'm leaning towards writing a wrapper for hamake that will generate the hamake file based on parameters (the number of levels is known at runtime but could change with each run).
Thanks!
Oozie http://yahoo.github.com/oozie/ is an Open Source server that Yahoo released to manage Hadoop & Pig workflow like you are asking
Cloudera has it in their latest distro with very good documentation https://wiki.cloudera.com/display/DOC/Oozie+Installation
here is a video http://sg.video.yahoo.com/watch/5936767/15449686 from Yahoo
You should be able to generate the pig code for this pretty easily using Piglet, the Ruby Pig DSL:
http://github.com/iconara/piglet

Resources