Accessing Distributed Cache in Pig StoreFunc - hadoop

I have looked at all the other threads on this topic and still have not found an answer...
Put simply, I want to access hadoop distributed cache from a Pig StoreFunc, and NOT from within a UDF directly.
Relevant PIG code lines:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
...
STORE BLAH INTO /path/ using CustomStorage();
Relevant Java Code:
public class KeyValStorage<M extends Message> extends BaseStoreFunc /* ElephantBird Storage which inherits from StoreFunc */ {
...
public KeyValStorage(String param1, String param2, String param3) {
...
try {
InputStream is = new FileInputStream(configName);
try {
prop.load(is);
} catch (IOException e) {
System.out.println("PROPERTY LOADING FAILED");
e.printStackTrace();
}
} catch (FileNotFoundException e) {
System.out.println("FILE NOT FOUND");
e.printStackTrace();
}
}
...
}
configName is the name of the LOCAL file that I should be able to read from distributed cache, however, I am getting a FileNotFoundException. When I use the EXACT same code from within a PIG UDF directly, the file is found, so I know the file is being shipped via distributed cache. I set the appropriate param to make sure this happens:
<property><name>mapred.cache.files</name><value>/path/to/file/file.properties#configName</value></property>
Any ideas how I can get around this?
Thanks!

StroreFunc's constructor is called both at frontend and backend. When it is called from the frontend, (before the job is launched) then you'll get FileNotFoundException because at this point the files from the distributed cache are not yet copied to the nodes' local disk.
You may check whether you are at the backend (when the job is being executed) and load the file only in this case e.g:
DEFINE CustomStorage KeyValStorage('param1','param2','param3');
set mapreduce.job.cache.files hdfs://host/user/cache/file.txt#config
...
STORE BLAH INTO /path/ using CustomStorage();
public KeyValStorage(String param1, String param2, String param3) {
...
try {
if (!UDFContext.getUDFContext().isFrontend()) {
InputStream is = new FileInputStream("./config");
BufferedReader br = new BufferedReader(new InputStreamReader(is));
...
...
}

Related

apache VFS2 uriStyle - root absolute path ends with double slash

while working on an ftp server with the vfs2 library I noticed, that I had to enable VFS.setUriStyle(true) so the library would change the working directory to parent directory of the target file I am operating on (cwd directoryName).
But if UriStyle is enabled, everything is being resolved relativly to the root. Which would not be a Problem if the root was not "//".
The class GenericFileName sets the absolutePath of the root to "/", which makes the Method getPath() return "/"+getUriTrailer() which in the case of the root always returns "//". Everything that is resolved relativly to // has two dots proceeding to their path.
Which means if I execute the following code:
public class RemoteFileTest {
public static void main(String[] args) {
// Options for a RemoteFileObject connection
VFS.setUriStyle(true);
FileSystemOptions options = new FileSystemOptions();
// we doing an ftp connection, hence we use the ftpConfigBuilder
// we want to work in passive mode
FtpFileSystemConfigBuilder.getInstance().setPassiveMode(options, true);
FtpFileSystemConfigBuilder.getInstance().setUserDirIsRoot(options, false);
// DefaultFileSystemConfigBuilder.getInstance().setRootURI(options, "/newRoot/");
// System.out.println(DefaultFileSystemConfigBuilder.getInstance().getRootURI(options));
// ftp://localhost:21/
StaticUserAuthenticator auth = new StaticUserAuthenticator("", "user", "pass");
try {
DefaultFileSystemConfigBuilder.getInstance().setUserAuthenticator(options, auth);
} catch (FileSystemException e) {
e.printStackTrace();
return;
}
// A FileSystemManager creates an abstract FileObject linked to are desired RemoteFile.
// That link is just simulated and not yet real.
FileSystemManager manager;
try {
manager = VFS.getManager();
} catch (FileSystemException e) {
e.printStackTrace();
return;
}
try (FileObject remoteFile = manager.resolveFile("ftp://localhost:21/sub_folder/test.txt", options)) {
System.out.println("Is Folder " + remoteFile.isFolder());
System.out.println("Is File " + remoteFile.isFile());
} catch (FileSystemException e) {
// TODO Auto-generated catch block
e.printStackTrace();
return;
}
}}
I receive this interaction with the ftp server:
USER user
PASS ****
TYPE I
CWD //
SYST
PASV
LIST ..sub_folder/
PWD
CWD ..sub_folder/
I want the interaction to be just like this, but without the two dots infront of the directory.
Kind regards
Barry
Fixed it as described below:
Disabled uriStyle again.
Wrote my own VFS class which creates my custom written Manager.
That Manager overwrites the FtpFileProvider with my custom one, which simply sets the root to a custom selected one, which causes the desired behaviour.
import org.apache.commons.vfs2.FileName;
import org.apache.commons.vfs2.FileObject;
import org.apache.commons.vfs2.FileSystem;
import org.apache.commons.vfs2.FileSystemException;
import org.apache.commons.vfs2.FileSystemOptions;
import org.apache.commons.vfs2.impl.DefaultFileSystemConfigBuilder;
import org.apache.commons.vfs2.provider.ftp.FtpFileProvider;
public class AdvancedFtpFileProvider extends FtpFileProvider {
public AdvancedFtpFileProvider() {
super();
// setFileNameParser(AdvancedFtpFileNameParser.getInstance());
}
#Override
protected FileObject findFile(FileName name, FileSystemOptions fileSystemOptions) throws FileSystemException {
// Check in the cache for the file system
//getContext().getFileSystemManager().resolveName... resolves the configured RootUri relative to the selected root (name.getRoot()). This calls cwd to the selectedRoot and operates from there with relatives urls towards the new root!
final FileName rootName = getContext().getFileSystemManager().resolveName(name.getRoot(), DefaultFileSystemConfigBuilder.getInstance().getRootURI(fileSystemOptions));
final FileSystem fs = getFileSystem(rootName, fileSystemOptions);
// Locate the file
// return fs.resolveFile(name.getPath());
return fs.resolveFile(name);
}
}
Came across this question because I was having the same issue with the following
ftp://user:pass#host//home/user/file.txt
becoming... (note the single slash after 'home')
ftp://user:pass#host/home/user/file.txt
I did this to solve the issue...
// Setup some options, add as many as you need
FileSystemOptions opts = new FileSystemOptions( );
// This line tells VFS to treat the URI as the absolute path and not relative
FtpsFileSystemConfigBuilder.getInstance( ).setUserDirIsRoot( opts, false );
// Retrieve the file from the remote FTP server
FileObject realFileObject = fileSystemManager.resolveFile( fileSystemUri, opts );
I hope this can help someone, if not then provide a reference for the next time this stumps me.

Log all methods called in an app by Xposed

As title said, I want to use xposed to log all methods called in an app from it start till I stop it. I only want to log Class name, Method name, don't want to hook all method.
I try this code, but get error getMethod not found.
findAndHookMethod("java.lang.Class", lpparam.classLoader, "getMethod", String.class, Object.class, new XC_MethodHook()
Thanks in advance!
There is no one line solution like what you seem to be searching.
Hooking all methods will let log what methods were called by app from it start till stop (sort of - see below), but if (for some reason) you don't want to hook all methods, the only solution I can think of is modifying the java VM itself (NOT something I would recommend.)
A solution that (sort of) works
What I did was first use apktool to decompile my apk and get the names of all the methods in all the classes.
Then I used xposed to hook into every single method of every class and print to the dlog the current function name.
Why it only sort of works
Xposed has an overhead whenever it hook a methods. For general usage of xposed apps, it isnt much. But when you start hooking each and every methods of an app, the overhead very quickly becomes ridiculously large - So much so that while the above methods works for small apps, for any large app it very quickly causes the app to hang and then crash.
An alternative that also sort-of works
FRIDA is a way to inject javascript to native apps. Here they show you how to log all function calls. While in the above link they log all function calls in a piece of python code, the same code also works for Android.
There is a way to log all Java methods.Modify XposedBridge.
Xposed hook java method through XposedBridge.java's method
"handleHookedMethod(Member method, int originalMethodId, Object additionalInfoObj, thisObject, Object[] args)"
Log.v(TAG, "className " + method.getClass().getName() + ",methodName " + method.getName());
As mentioned before Xposed is not the way to go in this situation due to its overhead.
The simplest solution is just to use dmtracedump as provided by Google. Most x86 Android images and emulator come with the debuggable flag on (ro.debuggable) so you can even use it for closed source apps.
Additionally other tools such as Emma are known to work with Android as well, but these might need modifications to the source code.
I found a solution.
See this code snippet below.
package com.kyunggi.logcalls;
import android.content.pm.*;
import android.util.*;
import dalvik.system.*;
import de.robv.android.xposed.*;
import de.robv.android.xposed.callbacks.XC_LoadPackage.*;
import java.io.*;
import java.lang.reflect.*;
import java.util.*;
import static de.robv.android.xposed.XposedHelpers.findAndHookMethod;
import android.app.*;
public class Main implements IXposedHookLoadPackage {
private String TAG = "LogCall";
public void handleLoadPackage(final LoadPackageParam lpparam) throws Throwable {
if (!lpparam.packageName.equals("com.android.bluetooth")) {
Log.i(TAG, "Not: " + lpparam.packageName);
return;
}
Log.i(TAG, "Yes " + lpparam.packageName);
//Modified https://d3adend.org/blog/?p=589
ApplicationInfo applicationInfo = AndroidAppHelper.currentApplicationInfo();
if (applicationInfo.processName.equals("com.android.bluetooth")) {
Set<String> classes = new HashSet<>();
DexFile dex;
try {
dex = new DexFile(applicationInfo.sourceDir);
Enumeration entries = dex.entries();
while (entries.hasMoreElements()) {
String entry = (String) entries.nextElement();
classes.add(entry);
}
dex.close();
} catch (IOException e) {
Log.e("HookDetection", e.toString());
}
for (String className : classes) {
boolean obex = false;
if (className.startsWith("com.android.bluetooth") || (obex = className.startsWith("javax.obex"))) {
try {
final Class clazz = lpparam.classLoader.loadClass(className);
for (final Method method : clazz.getDeclaredMethods()) {
if (obex) {
if (!Modifier.isPublic(method.getModifiers())) {
continue; //on javax.obex package, hook only public APIs
}
}
XposedBridge.hookMethod(method, new XC_MethodHook() {
final String methodNam = method.getName();
final String classNam = clazz.getName();
final StringBuilder sb = new StringBuilder("[");
final String logstr = "className " + classNam + ",methodName " + methodNam;
#Override
protected void beforeHookedMethod(MethodHookParam param) throws Throwable {
//Method method=(Method)param.args[0];
sb.setLength(0);
sb.append(logstr);
//Log.v(TAG,logstr);
for (Object o : param.args) {
String typnam = "";
String value = "null";
if (o != null) {
typnam = o.getClass().getName();
value = o.toString();
}
sb.append(typnam).append(" ").append(value).append(", ");
}
sb.append("]");
Log.v(TAG, sb.toString());
}
});
}
} catch (ClassNotFoundException e) {
Log.wtf("HookDetection", e.toString());
}
}
}
}
// ClassLoader rootcl=lpparam.classLoader.getSystemClassLoader();
//findAndHookMethod("de.robv.android.xposed.XposedBridge", rootcl, "handleHookedMethod", Member.class, int.class, Object.class, Object.class, Object[].class, );
}
}

Write data to local disk in each datanode

I want to store some value in map task into local disk in each data node. For example,
public void map (...) {
//Process
List<Object> cache = new ArrayList<Object>();
//Add value to cache
//Serialize cache to local file in this data node
}
How can I store this cache object to local disk in each data node, because if I store this cache in map function like above, then the performance will be terrible because I/O task?
I mean is there any way to wait for map task in this data node run completely and then we will store this cache into local disk? Or does Hadoop have a function to solve this issue?
Please see below example, the created file will be somewhere under the directories used by NodeManager for containers. This is configuration property yarn.nodemanager.local-dirs in yarn-site.xml, or the default inherited from yarn-default.xml, which is under /tmp
Please see #Chris Nauroth answer,
Which says that Its just for debugging purpose and It's not recommended as a permanent production configuration. It was clearly described why it was not recommended.
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
// do some hadoop stuff, like counting words
String path = "newFile.txt";
try {
File f = new File(path);
f.createNewFile();
} catch (IOException e) {
System.out.println("Message easy to look up in the logs.");
System.err.println("Error easy to look up in the logs.");
e.printStackTrace();
throw e;
}
}

Profiling Hadoop

UPDATE:
I had mailed Shevek, founder of Karmasphere, for help. He had given a presentation on hadoop profiling at ApacheCon 2011. He advised to look for Throwable. Catch block for Throwable shows :
localhost: java.lang.IncompatibleClassChangeError: class com.kannan.mentor.sourcewalker.ClassInfoGatherer has interface org.objectweb.asm.ClassVisitor as super class
localhost: at java.lang.ClassLoader.defineClass1(Native Method)
localhost: at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
Hadoop has ASM3.2 jar and I am using 5.0. In 5.0, ClassVisitor is a Super Class and in 3.2 it is an Interface. I am planning to change my profiler to 3.2. Is there any other better way to fix this issue?
BTW, Shevek is super cool. A Founder and CEO, responding to some
anonymous guys emails. Imagine that.
END UPDATE
I am trying to profile Hadoop (JobTracker, Name Node, Data Node etc). Created a profiler using ASM5. Tested it on Spring and everything works fine.
Then tested the profiler on Hadoop in pseudo-distributed mode.
#Override
public byte[] transform(ClassLoader loader, String className,
Class<?> classBeingRedefined, ProtectionDomain protectionDomain,
byte[] classfileBuffer) throws IllegalClassFormatException {
try {
/*1*/ System.out.println(" inside transformer " + className);
ClassReader cr = new ClassReader(classfileBuffer);
ClassWriter cw = new ClassWriter(ClassWriter.COMPUTE_MAXS);
/* c-start */ // CheckClassAdapter cxa = new CheckClassAdapter(cw);
ClassVisitor cv = new ClassInfoGatherer(cw);
/* c-end */ cr.accept(cv, ClassReader.EXPAND_FRAMES);
byte[] b = cw.toByteArray();
/*2*/System.out.println(" inside transformer - returning" + b.length);
return b;
} catch (Exception e) {
System.out.println( " class might not be found " + e.getMessage()) ;
try {
throw new ClassNotFoundException(className, e);
} catch (ClassNotFoundException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
return null;
}
I can see the first sysout statement printed but not the second one. There is no error either. If I comment out from /* c-start / to / c-stop*/ and replace cw with classFileBuffer, I can see the second sysout statement. The moment I uncomment line
ClassVisitor cv = new ClassInfoGatherer(cw);
ClassInfoGatherer constructor:
public ClassInfoGatherer(ClassVisitor cv) {
super(ASM5, cv);
}
I am not seeing the second sysout statement.
What am i doing wrong here. Is Hadoop swallowing my sysouts. Tried sys err too. Even if that is the case why can i see the first sysout statement?
Any suggestion would be helpful. I think I am missing something simple and obvious here...but can't figure it out.
following lines were added to hadoop-env.sh
export HADOOP_NAMENODE_OPTS="-javaagent:path to jar $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-path to jar $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-javaagent:path to jar $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-javaagent:path to jar $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-javaagent:path to jar $HADOOP_JOBTRACKER_OPTS"
Hadoop had asm 3.2 and I was using ASM 5. In ASM5, ClassVisitor is a Superclass and in 3.2 it is an interface. For some reason, the error was a Throwable (credits to Shevek) and catch block was only catching exceptions. The throwable error wasn't captured in any of hadoop logs. So, it was very tough to debug.
Used jar jar links to fix asm version issues and everything works fine now.
If you are using Hadoop and something is not working and no logs to show any error's, then please try to catch Throwable.
Arun

BIRT csv emitter and EJB

I want to use csv emitter plugin in a Java EE application. Is it possible? I get the following error:
org.eclipse.birt.report.engine.api.UnsupportedFormatException: The output format csv is not supported.
at org.eclipse.birt.report.engine.api.impl.EngineTask.setupRenderOption(EngineTask.java:2047)
at org.eclipse.birt.report.engine.api.impl.RunAndRenderTask.doRun(RunAndRenderTask.java:96)
at org.eclipse.birt.report.engine.api.impl.RunAndRenderTask.run(RunAndRenderTask.java:77)
My code:
protected String generateReportFile(IRunAndRenderTask task, IReportRunnable design, IReportEngine engine, String reportType, String reportPrefix, String baseDir) throws BirtReportGenerationFault {
CSVRenderOption csvOptions = new CSVRenderOption();
csvOptions.setOutputFormat(CSVRenderOption.OUTPUT_FORMAT_CSV);
csvOptions.setOutputFileName("C:/birt/logs/csvTestW.csv");
csvOptions.setShowDatatypeInSecondRow(false);
csvOptions.setExportTableByName("data");
csvOptions.setDelimiter("\t");
csvOptions.setReplaceDelimiterInsideTextWith("-");
task.setRenderOption(csvOptions);
task.setEmitterID("org.eclipse.birt.report.engine.emitter.csv");
try {
task.run();// Error here
} catch (EngineException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
task.close();
return "C:/birt/logs/csvTestW.csv";//fileName;
}
Same code works in a Java SE app.
I had the same problem, but with the pdf format. I solved it by adding org.eclipse.birt.report.engine.emitter.pdf to the plugin Dependencies.
I think the problem here is case of output format passed to in CSVRenderOptions.
instead of using csvOptions.setOutputFormat(CSVRenderOption.OUTPUT_FORMAT_CSV);
try using csvOptions.setOutputFormat("CSV");

Resources