How can I debug a pig script - hadoop

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?

There are several method to debug a pig script. Simple method is step by step execution of a relation and then verify the result. These commands are useful to debug a pig script.
DUMP - Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen.
ILLUSTRATE - Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.
EXPLAIN - Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.
DESCRIBE - Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.
More detail about these commands are available on this link.
Also please refer developing and testing a pig script. to know more detail.
If you want to debug whole script during execution then you need to write below code at top of your script
-- set the debug mode on
SET debug 'on'
-- set a job name of your job.
SET job.name 'my job'
This will allow to run your script into debug mode. mode detail on about SET command is available on this link

When you say the script is stuck at 70%, I assume you mean the MR job is 70% complete.
It's best to look at MR and YARN logs (and if needed, HDFS logs) at that point for more information about what MR/YARN is doing. Logs can be typically found under /var/log/hadoop-mapreduce and /var/log/hadoop-hdfs in Cloudera Manager managed clusters. You may need to examine logs from multiple nodes in the cluster where YARN NodeManagers are running.
In case your script is stuck with a Pig issue (i.e. issue in Pig code, not MR/HDFS code), it is useful to increase the log4j logging level in Pig:
pig -d DEBUG is the command line option to set the logging level to DEBUG for example.

Related

MapReduce Code for executing .ds2 scoring file in hadoop cluster

I have a .ds2 file created from SAS scoring accelerator .
The sas scoring accelerator has a macro %INDHD_RUN_MODEL which creates a MapReduce code internally to execute the .ds2 file.
How can I see the code or how can I create a similar mapreduce code.
So that I can schedule the scoring through oozie.
The %indhd_run_model macro is pre-compiled and stored in SASHOME\SASFoundation\9.4\accelmvahadp\cmacros. It appears that the developers of this code want to keep the contents hidden.
libname macros 'C:\SAS\SASHome\SASFoundation\9.4\accelmvahadp\cmacros';
options sasmstore=macros mstored;
%COPY indhd_run_model / source;
NOTE: The macro %INDHD_RUN_MODEL was compiled with the /SECURE option. No output will be produced for this %COPY statement.
ERROR: The /SOURCE option was not specified when the macro
INDHD_RUN_MODEL was compiled.
Because the /secure option is enabled, it is not possible to specifically view the Hadoop code template; however, you can get better insight by enabling the following options in SAS:
options mlogic mlogicnest sastrace=',,,d' sastraceloc=saslog nostsuffix;
This will turn on a few things:
The actual logic of different parts of the macro being run
The HiveQL queries being sent by SAS, and the response from Hadoop; output is in the SAS log
You can also play around with the arguments of the sastrace option to get different types of information.
Worst case, you'll need to contact SAS Tech Support to get more help on the issue. I would recommend calling, as you'll have a much faster response.
You can schedule %INDHD_RUN_MODEL, but can't schedule the map-reduce jobs directly.SAS EP(Embeded Process) translate .ds2 code to map-reduce jobs, and then excuted it. The whole process should be a black box for end users.

Java Vs Scripting for HDFS map/reduce

I am a DB person, so java is new to me. Looking for scripting language for working with HDFS, may be Python I am looking for. But I see in one of the previous question, you mentioned that "Heart Beat" between Name and Data node will not happen if we use scripting language. Why, I could not understand? As we are writing our application logic to process data in the scripts or java code, and how it matter for the "Heart Beat"?
Any idea, on this?
Python is good choice for hadoop if you know already how to code with it. I've used php and perl with success. This part of Hadoop framework is called Streaming.
For "Heart Beat" I believe you are thinking of Counters. They are user defined "variables" that can only be incremented. Hadoop will terminate task attempt if no counters are incremented for 10 minutes. However you shouldn't worry about this as there are system counters that are automatically incremented for you. If you do have a job that takes very long, you can still use counters with Python (Hadoop Streaming) by sending something like this to standard error output:
reporter:counter:MyGroup,MyCounter,1
For more info on counters with Hadoop Streaming see this

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Variable/looping sequence of jobs

I'm considering using hadoop/mapreduce to tackle a project and haven't quite figured out how to set up a job flow consisting of a variable number of levels that should be processed in sequence.
E.g.:
Job 1: Map source data into X levels.
Job 2: MapReduce Level1 -> appends to Level2
Job 3: MapReduce Level2 -> appends to LevelN
Job N: MapReduce LevelN -> appends to LevelN+1
And so on until the final level. The key is that each level must include its own specific source data as well as the results of the previous level.
I've looked at pig, hive, hamake, and cascading, but have yet to see clear support for something like this.
Does anyone know an efficient way of accomplishing this? Right now I'm leaning towards writing a wrapper for hamake that will generate the hamake file based on parameters (the number of levels is known at runtime but could change with each run).
Thanks!
Oozie http://yahoo.github.com/oozie/ is an Open Source server that Yahoo released to manage Hadoop & Pig workflow like you are asking
Cloudera has it in their latest distro with very good documentation https://wiki.cloudera.com/display/DOC/Oozie+Installation
here is a video http://sg.video.yahoo.com/watch/5936767/15449686 from Yahoo
You should be able to generate the pig code for this pretty easily using Piglet, the Ruby Pig DSL:
http://github.com/iconara/piglet

Resources