Profiling Hadoop - hadoop

UPDATE:
I had mailed Shevek, founder of Karmasphere, for help. He had given a presentation on hadoop profiling at ApacheCon 2011. He advised to look for Throwable. Catch block for Throwable shows :
localhost: java.lang.IncompatibleClassChangeError: class com.kannan.mentor.sourcewalker.ClassInfoGatherer has interface org.objectweb.asm.ClassVisitor as super class
localhost: at java.lang.ClassLoader.defineClass1(Native Method)
localhost: at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
Hadoop has ASM3.2 jar and I am using 5.0. In 5.0, ClassVisitor is a Super Class and in 3.2 it is an Interface. I am planning to change my profiler to 3.2. Is there any other better way to fix this issue?
BTW, Shevek is super cool. A Founder and CEO, responding to some
anonymous guys emails. Imagine that.
END UPDATE
I am trying to profile Hadoop (JobTracker, Name Node, Data Node etc). Created a profiler using ASM5. Tested it on Spring and everything works fine.
Then tested the profiler on Hadoop in pseudo-distributed mode.
#Override
public byte[] transform(ClassLoader loader, String className,
Class<?> classBeingRedefined, ProtectionDomain protectionDomain,
byte[] classfileBuffer) throws IllegalClassFormatException {
try {
/*1*/ System.out.println(" inside transformer " + className);
ClassReader cr = new ClassReader(classfileBuffer);
ClassWriter cw = new ClassWriter(ClassWriter.COMPUTE_MAXS);
/* c-start */ // CheckClassAdapter cxa = new CheckClassAdapter(cw);
ClassVisitor cv = new ClassInfoGatherer(cw);
/* c-end */ cr.accept(cv, ClassReader.EXPAND_FRAMES);
byte[] b = cw.toByteArray();
/*2*/System.out.println(" inside transformer - returning" + b.length);
return b;
} catch (Exception e) {
System.out.println( " class might not be found " + e.getMessage()) ;
try {
throw new ClassNotFoundException(className, e);
} catch (ClassNotFoundException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
return null;
}
I can see the first sysout statement printed but not the second one. There is no error either. If I comment out from /* c-start / to / c-stop*/ and replace cw with classFileBuffer, I can see the second sysout statement. The moment I uncomment line
ClassVisitor cv = new ClassInfoGatherer(cw);
ClassInfoGatherer constructor:
public ClassInfoGatherer(ClassVisitor cv) {
super(ASM5, cv);
}
I am not seeing the second sysout statement.
What am i doing wrong here. Is Hadoop swallowing my sysouts. Tried sys err too. Even if that is the case why can i see the first sysout statement?
Any suggestion would be helpful. I think I am missing something simple and obvious here...but can't figure it out.
following lines were added to hadoop-env.sh
export HADOOP_NAMENODE_OPTS="-javaagent:path to jar $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-path to jar $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-javaagent:path to jar $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-javaagent:path to jar $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-javaagent:path to jar $HADOOP_JOBTRACKER_OPTS"

Hadoop had asm 3.2 and I was using ASM 5. In ASM5, ClassVisitor is a Superclass and in 3.2 it is an interface. For some reason, the error was a Throwable (credits to Shevek) and catch block was only catching exceptions. The throwable error wasn't captured in any of hadoop logs. So, it was very tough to debug.
Used jar jar links to fix asm version issues and everything works fine now.
If you are using Hadoop and something is not working and no logs to show any error's, then please try to catch Throwable.
Arun

Related

Runtime.exec() method always returns empty in Spring environment

I tried to execute Runtime.exec(…) method in my Spring application and then always getting empty result from getInputStream() method. But it's working well if I execute as a core java application. Is any other implementations needed to execute in Spring environment? Thanks in advance.
Spring Version : 4.2.1.RELEASE
commons-io Version : 2.4
try {
Process p = Runtime.getRuntime().exec(new String[] {"sh", "-c", "spamc < abcd.txt"});
String spamResponsexx = IOUtils.toString(p.getInputStream());
log.debug("Spam Response : " + spamResponsexx);
} catch (Exception e) {
e.printStackTrace();
log.error("", e);
}
Maybe User privileges issues. I'm also worked in the similar type of scenario.
In this case both services should be run as same user(XYZ)
Otherwise 'spamc' library files location should be point to your user(XYZ) HOME path '.bashrc' and reset '.bashrc'

Accessing Distributed Cache in Pig StoreFunc

I have looked at all the other threads on this topic and still have not found an answer...
Put simply, I want to access hadoop distributed cache from a Pig StoreFunc, and NOT from within a UDF directly.
Relevant PIG code lines:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
...
STORE BLAH INTO /path/ using CustomStorage();
Relevant Java Code:
public class KeyValStorage<M extends Message> extends BaseStoreFunc /* ElephantBird Storage which inherits from StoreFunc */ {
...
public KeyValStorage(String param1, String param2, String param3) {
...
try {
InputStream is = new FileInputStream(configName);
try {
prop.load(is);
} catch (IOException e) {
System.out.println("PROPERTY LOADING FAILED");
e.printStackTrace();
}
} catch (FileNotFoundException e) {
System.out.println("FILE NOT FOUND");
e.printStackTrace();
}
}
...
}
configName is the name of the LOCAL file that I should be able to read from distributed cache, however, I am getting a FileNotFoundException. When I use the EXACT same code from within a PIG UDF directly, the file is found, so I know the file is being shipped via distributed cache. I set the appropriate param to make sure this happens:
<property><name>mapred.cache.files</name><value>/path/to/file/file.properties#configName</value></property>
Any ideas how I can get around this?
Thanks!
StroreFunc's constructor is called both at frontend and backend. When it is called from the frontend, (before the job is launched) then you'll get FileNotFoundException because at this point the files from the distributed cache are not yet copied to the nodes' local disk.
You may check whether you are at the backend (when the job is being executed) and load the file only in this case e.g:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
set mapreduce.job.cache.files hdfs://host/user/cache/file.txt#config
...
STORE BLAH INTO /path/ using CustomStorage();
public KeyValStorage(String param1, String param2, String param3) {
...
try {
if (!UDFContext.getUDFContext().isFrontend()) {
InputStream is = new FileInputStream("./config");
BufferedReader br = new BufferedReader(new InputStreamReader(is));
...
...
}

BIRT csv emitter and EJB

I want to use csv emitter plugin in a Java EE application. Is it possible? I get the following error:
org.eclipse.birt.report.engine.api.UnsupportedFormatException: The output format csv is not supported.
at org.eclipse.birt.report.engine.api.impl.EngineTask.setupRenderOption(EngineTask.java:2047)
at org.eclipse.birt.report.engine.api.impl.RunAndRenderTask.doRun(RunAndRenderTask.java:96)
at org.eclipse.birt.report.engine.api.impl.RunAndRenderTask.run(RunAndRenderTask.java:77)
My code:
protected String generateReportFile(IRunAndRenderTask task, IReportRunnable design, IReportEngine engine, String reportType, String reportPrefix, String baseDir) throws BirtReportGenerationFault {
CSVRenderOption csvOptions = new CSVRenderOption();
csvOptions.setOutputFormat(CSVRenderOption.OUTPUT_FORMAT_CSV);
csvOptions.setOutputFileName("C:/birt/logs/csvTestW.csv");
csvOptions.setShowDatatypeInSecondRow(false);
csvOptions.setExportTableByName("data");
csvOptions.setDelimiter("\t");
csvOptions.setReplaceDelimiterInsideTextWith("-");
task.setRenderOption(csvOptions);
task.setEmitterID("org.eclipse.birt.report.engine.emitter.csv");
try {
task.run();// Error here
} catch (EngineException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
task.close();
return "C:/birt/logs/csvTestW.csv";//fileName;
}
Same code works in a Java SE app.
I had the same problem, but with the pdf format. I solved it by adding org.eclipse.birt.report.engine.emitter.pdf to the plugin Dependencies.
I think the problem here is case of output format passed to in CSVRenderOptions.
instead of using csvOptions.setOutputFormat(CSVRenderOption.OUTPUT_FORMAT_CSV);
try using csvOptions.setOutputFormat("CSV");

how to prevent hadoop job to fail on corrupted input file

I'm running hadoop job on many input files.
But if one of the files is corrupted the whole job is fails.
How can I make the job to ignore the corrupted file?
maybe write for me some counter/error log but not fail the whole job
It depends on where your job is failing - if a line is corrupt, and somewhere in your map method an Exception is thrown then you should just be able to wrap the body of your map method with a try / catch and just log the error:
protected void map(LongWritable key, Text value, Context context) {
try {
// parse value to a long
int val = Integer.parseInt(value.toString());
// do something with key and val..
} catch (NumberFormatException nfe) {
// log error and continue
}
}
But if the error is thrown by your InputFormat's RecordReader then you'll need to amend the mappers run(..) method - who's default implementation is as follows:
public void run(Context context) {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}
So you could amend this to try and catch the exception on the context.nextKeyValue() call but you have to be careful on just ignoring any errors thrown by the reader - an IOExeption for example may not be 'skippable' by just ignoring the error.
If you have written your own InputFormat / RecordReader, and you have a specific exception which denotes record failure but will allow you to skip over and continue parsing, then something like this will probably work:
public void run(Context context) {
setup(context);
while (true) {
try {
if (!context.nextKeyValue()) {
break;
} else {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} catch (SkippableRecordException sre) {
// log error
}
}
cleanup(context);
}
But just to re-itterate - your RecordReader must be able to recover on error otherwise the above code could send you into an infinite loop.
For your specific case - if you just want to ignore a file upon the first failure then you can update the run method to something much simpler:
public void run(Context context) {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
} catch (Exception e) {
// log error
}
}
Some final words of warning:
You need to make sure that it isn't your mapper code which is causing the exception to be thrown, otherwise you'll be ignoring files for the wrong reason
GZip compressed files which are not GZip compressed will actually fail in the initialization of the record reader - so the above will not catch this type or error (you'll need to write your own record reader implementation). This is true for any file error that is thrown during record reader creation
This is what Failure Traps are used for in cascading:
Whenever an operation fails and throws an exception, if there is an associated trap, the offending Tuple is saved to the resource specified by the trap Tap. This allows the job to continue processing without any data loss.
This will essentially let your job continue and let you check your corrupt files later
If you are somewhat familiar with cascading in your flow definition statement:
new FlowDef().addTrap( String branchName, Tap trap );
Failure Traps
There is also another possible way. You could use mapred.max.map.failures.percent configuration option. Of course this way of solving this problem could also hide some other problems occurring during map phase.

Selenium RC - Getting Null Pointer Exception after exceuting my test script

I am Using a assertTrue statement in the web app testing,even the text is present in the web its returning false. How can I fix this?
public void Test1() throws Exception {
Selenium selenium = new DefaultSelenium("localhost",1777,"*chrome","http://www.mortgagecalculator.org/");
selenium.start();
try {
selenium.open("/");
Thread.sleep(3000);
selenium.windowMaximize();
selenium.type("name=param[homevalue]", "400000");
selenium.type("name=param[principal]", "7888800");
selenium.select("name=param[rp]", "label=New Purchase");
selenium.type("name=param[interest_rate]", "8");
selenium.type("name=param[term]", "35");
selenium.select("name=param[start_month]", "label=May");
selenium.select("name=param[start_year]", "label=2006");
selenium.type("name=param[property_tax]", "7.5");
selenium.type("name=param[pmi]", "0.8");
selenium.click("css=input[type=\"submit\"]");
assertTrue(selenium.isTextPresent("$58,531.06"));
System.out.println("Assert Statement executed");
selenium.stop();
}
catch (Exception e) {
System.out.println("In Mortgage Calculator App exception Happened");
}
}
I've given your test a run and It seems to be fine for me however I suspect this is to do with the fact that in here:
selenium.click("css=input[type=\"submit\"]");
assertTrue(selenium.isTextPresent("$58,531.06"));
You don't have any kind of wait, so it will only successfully find "$58,531.06" if the page/result loads quick enough for the test to not bomb out already as it moves to the assertTrue.
I'd suggest using a 'waitForTextPresent' here instead perhaps, or putting something in there to ensure that the test can load the result before you check for it.
Hope that helps.

Resources