Pig non-aggregated warnings output location? - hadoop

Pig: 0.8.1-cdh3u2
Hadoop: 0.20.2-cdh3u0
Debugging FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings, but I can't seem to make individual warnings printed anywhere. Disabling aggregation via -w or aggregate.warnings=false switch removes the summary messages, BUT it does remove the actual warning too, so I can't see what type conversion failed.
There's nothing written in the pig's log for this run, AND there's no place I can locate the logs with the individual warnings. Did I miss anything obvious or it simply doesn't work?

Hadoop job logs are recorded locally on each compute node. Therefore you first you need to setup your hadoop cluster manager to collect the logfiles onto the distributed files system so that you can analyse them. If you use Hadoop-on-demand (http://hadoop.apache.org/docs/r0.17.0/hod.html) you should be able to do that by specifying something like:
log-destination-uri = hdfs://host123:45678/user/hod/logs
See the HOD documentation at http://hadoop.apache.org/docs/r0.17.0/hod_user_guide.html#Collecting+and+Viewing+Hadoop+Logs
After you have the logs on HDFS you can run a simple PIG query to find the offending conversion. Something like the following should do the trick:
a1= LOAD '*.log' USING PigStorage(']') ;
a2= FILTER a1 by ($1 MATCHES ' WARN.*Unable to interpret value.*');
dump a2;

It's difficult to find which data or value is causing issue, but at least you can find which column is creating this issue. Once you find the column you can use Dynamic Invoker which may help you in type conversion.
How to use Dynamic Invoker :
DEFINE ConvertToDouble InvokeForDouble('java.lang.Double.parseDouble', 'String');
ConvertToDouble(column_name);

Related

MapReduce Code for executing .ds2 scoring file in hadoop cluster

I have a .ds2 file created from SAS scoring accelerator .
The sas scoring accelerator has a macro %INDHD_RUN_MODEL which creates a MapReduce code internally to execute the .ds2 file.
How can I see the code or how can I create a similar mapreduce code.
So that I can schedule the scoring through oozie.
The %indhd_run_model macro is pre-compiled and stored in SASHOME\SASFoundation\9.4\accelmvahadp\cmacros. It appears that the developers of this code want to keep the contents hidden.
libname macros 'C:\SAS\SASHome\SASFoundation\9.4\accelmvahadp\cmacros';
options sasmstore=macros mstored;
%COPY indhd_run_model / source;
NOTE: The macro %INDHD_RUN_MODEL was compiled with the /SECURE option. No output will be produced for this %COPY statement.
ERROR: The /SOURCE option was not specified when the macro
INDHD_RUN_MODEL was compiled.
Because the /secure option is enabled, it is not possible to specifically view the Hadoop code template; however, you can get better insight by enabling the following options in SAS:
options mlogic mlogicnest sastrace=',,,d' sastraceloc=saslog nostsuffix;
This will turn on a few things:
The actual logic of different parts of the macro being run
The HiveQL queries being sent by SAS, and the response from Hadoop; output is in the SAS log
You can also play around with the arguments of the sastrace option to get different types of information.
Worst case, you'll need to contact SAS Tech Support to get more help on the issue. I would recommend calling, as you'll have a much faster response.
You can schedule %INDHD_RUN_MODEL, but can't schedule the map-reduce jobs directly.SAS EP(Embeded Process) translate .ds2 code to map-reduce jobs, and then excuted it. The whole process should be a black box for end users.

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext.
I use pyspark. I tried
os.system("hadoop fs -test -e %s" %path)
but as I have a lot of paths to check, the job crashed.
I tried also sc.wholeTextFiles(parent_path) and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files.
Could you help me?
Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.
Have you tried using pydoop? The exists function should work
One possibility is that you can use hadoop fs -lsr your_path to get all the paths, and then check if the paths you're interested in are in that set.
Regarding your crash, it's possible it was a result of all the calls to os.system, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).
One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.
It may also be a good idea to switch to the subprocess module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system:
process = subprocess.check_output(
args=your_script,
stdout=PIPE,
shell=True
)
Note that you can switch stdout to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True argument to False unless you're going to call an actual script or use shell-specific things like pipes or redirection.

How can I debug a pig script

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?
There are several method to debug a pig script. Simple method is step by step execution of a relation and then verify the result. These commands are useful to debug a pig script.
DUMP - Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen.
ILLUSTRATE - Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.
EXPLAIN - Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.
DESCRIBE - Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.
More detail about these commands are available on this link.
Also please refer developing and testing a pig script. to know more detail.
If you want to debug whole script during execution then you need to write below code at top of your script
-- set the debug mode on
SET debug 'on'
-- set a job name of your job.
SET job.name 'my job'
This will allow to run your script into debug mode. mode detail on about SET command is available on this link
When you say the script is stuck at 70%, I assume you mean the MR job is 70% complete.
It's best to look at MR and YARN logs (and if needed, HDFS logs) at that point for more information about what MR/YARN is doing. Logs can be typically found under /var/log/hadoop-mapreduce and /var/log/hadoop-hdfs in Cloudera Manager managed clusters. You may need to examine logs from multiple nodes in the cluster where YARN NodeManagers are running.
In case your script is stuck with a Pig issue (i.e. issue in Pig code, not MR/HDFS code), it is useful to increase the log4j logging level in Pig:
pig -d DEBUG is the command line option to set the logging level to DEBUG for example.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Resources