MapReduce Code for executing .ds2 scoring file in hadoop cluster - hadoop

I have a .ds2 file created from SAS scoring accelerator .
The sas scoring accelerator has a macro %INDHD_RUN_MODEL which creates a MapReduce code internally to execute the .ds2 file.
How can I see the code or how can I create a similar mapreduce code.
So that I can schedule the scoring through oozie.

The %indhd_run_model macro is pre-compiled and stored in SASHOME\SASFoundation\9.4\accelmvahadp\cmacros. It appears that the developers of this code want to keep the contents hidden.
libname macros 'C:\SAS\SASHome\SASFoundation\9.4\accelmvahadp\cmacros';
options sasmstore=macros mstored;
%COPY indhd_run_model / source;
NOTE: The macro %INDHD_RUN_MODEL was compiled with the /SECURE option. No output will be produced for this %COPY statement.
ERROR: The /SOURCE option was not specified when the macro
INDHD_RUN_MODEL was compiled.
Because the /secure option is enabled, it is not possible to specifically view the Hadoop code template; however, you can get better insight by enabling the following options in SAS:
options mlogic mlogicnest sastrace=',,,d' sastraceloc=saslog nostsuffix;
This will turn on a few things:
The actual logic of different parts of the macro being run
The HiveQL queries being sent by SAS, and the response from Hadoop; output is in the SAS log
You can also play around with the arguments of the sastrace option to get different types of information.
Worst case, you'll need to contact SAS Tech Support to get more help on the issue. I would recommend calling, as you'll have a much faster response.

You can schedule %INDHD_RUN_MODEL, but can't schedule the map-reduce jobs directly.SAS EP(Embeded Process) translate .ds2 code to map-reduce jobs, and then excuted it. The whole process should be a black box for end users.

Related

How can I debug a pig script

If while running a simple group by script in pig for large terabytes of data, the script got stuck at say 70%, then what can be done to diagnose the problem?
There are several method to debug a pig script. Simple method is step by step execution of a relation and then verify the result. These commands are useful to debug a pig script.
DUMP - Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen.
ILLUSTRATE - Use the ILLUSTRATE operator to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE allows you to test your programs on small datasets and get faster turnaround times.
EXPLAIN - Use the EXPLAIN operator to review the logical, physical, and map reduce execution plans that are used to compute the specified relationship.
DESCRIBE - Use the DESCRIBE operator to view the schema of a relation. You can view outer relations as well as relations defined in a nested FOREACH statement.
More detail about these commands are available on this link.
Also please refer developing and testing a pig script. to know more detail.
If you want to debug whole script during execution then you need to write below code at top of your script
-- set the debug mode on
SET debug 'on'
-- set a job name of your job.
SET job.name 'my job'
This will allow to run your script into debug mode. mode detail on about SET command is available on this link
When you say the script is stuck at 70%, I assume you mean the MR job is 70% complete.
It's best to look at MR and YARN logs (and if needed, HDFS logs) at that point for more information about what MR/YARN is doing. Logs can be typically found under /var/log/hadoop-mapreduce and /var/log/hadoop-hdfs in Cloudera Manager managed clusters. You may need to examine logs from multiple nodes in the cluster where YARN NodeManagers are running.
In case your script is stuck with a Pig issue (i.e. issue in Pig code, not MR/HDFS code), it is useful to increase the log4j logging level in Pig:
pig -d DEBUG is the command line option to set the logging level to DEBUG for example.

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Pig non-aggregated warnings output location?

Pig: 0.8.1-cdh3u2
Hadoop: 0.20.2-cdh3u0
Debugging FIELD_DISCARDED_TYPE_CONVERSION_FAILED warnings, but I can't seem to make individual warnings printed anywhere. Disabling aggregation via -w or aggregate.warnings=false switch removes the summary messages, BUT it does remove the actual warning too, so I can't see what type conversion failed.
There's nothing written in the pig's log for this run, AND there's no place I can locate the logs with the individual warnings. Did I miss anything obvious or it simply doesn't work?
Hadoop job logs are recorded locally on each compute node. Therefore you first you need to setup your hadoop cluster manager to collect the logfiles onto the distributed files system so that you can analyse them. If you use Hadoop-on-demand (http://hadoop.apache.org/docs/r0.17.0/hod.html) you should be able to do that by specifying something like:
log-destination-uri = hdfs://host123:45678/user/hod/logs
See the HOD documentation at http://hadoop.apache.org/docs/r0.17.0/hod_user_guide.html#Collecting+and+Viewing+Hadoop+Logs
After you have the logs on HDFS you can run a simple PIG query to find the offending conversion. Something like the following should do the trick:
a1= LOAD '*.log' USING PigStorage(']') ;
a2= FILTER a1 by ($1 MATCHES ' WARN.*Unable to interpret value.*');
dump a2;
It's difficult to find which data or value is causing issue, but at least you can find which column is creating this issue. Once you find the column you can use Dynamic Invoker which may help you in type conversion.
How to use Dynamic Invoker :
DEFINE ConvertToDouble InvokeForDouble('java.lang.Double.parseDouble', 'String');
ConvertToDouble(column_name);

How do you deal with empty or missing input files in Apache Pig?

Our workflow uses an AWS elastic map reduce cluster to run series of Pig jobs to manipulate a large amount of data into aggregated reports. Unfortunately, the input data is potentially inconsistent, and can result in either no input files or 0 byte files being given to the pipeline or even being produced by some stages of the pipeline.
During a LOAD statement, Pig fails spectacularly if it either doesn't find any input files or any of the input files are 0 bytes.
Is there any good way to work around this (hopefully within the Pig configuration or script or the Hadoop cluster configuration, without writing a custom loader...)?
(Since we're using AWS elastic map reduce, we're stuck with Pig 0.6.0 and Hadoop 0.20.)
(For posterity, a sub-par solution we've come up with:)
To deal with the 0-byte problem, we've found that we can detect the situation and instead insert a file with a single newline. This causes a message like:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 13 time(s).
but at least Pig doesn't crash with an exception.
Alternatively, we could produce a line with the appropriate number of '\t' characters for that file which would avoid the warning, but it would insert garbage into the data that we would then have to filter out.
These same ideas could be used to solve the no input files condition by creating a dummy file, but it has the same downsides as are listed above.
The approach I've been using is to run pig scripts from a shell. I have one job that gets data from six different input directories. So I've written a fragment for each input file.
The shell checks for the existence of the input file and assembles a final pig script from the fragments.
It then executes the final pig script. I know it's a bit of a Rube Goldberg approach, but so far so good. :-)

Resources