Unable to load OpenNLP sentence model in Hadoop map-reduce job - hadoop

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin
logger.info("sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (modelIn != null) {
try {
modelIn.close();
} catch (IOException e) {
}
}
}
logger.info("number of sentences: " + sentences.length);
<snip>
}
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
I've verified that the model file exists in the location sentenceModelPath refers to.
I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?

Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open(
new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):
modelIn = new FileInputStream("en-sent.bin");
Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>

Related

How to get rid of NullPointerException in Flume Interceptor?

I have an interceptor written for Flume code is below:
public Event intercept(Event event) {
byte[] xmlstr = event.getBody();
InputStream instr = new ByteArrayInputStream(xmlstr);
//TransformerFactory factory = TransformerFactory.newInstance(TRANSFORMER_FACTORY_CLASS,TRANSFORMER_FACTORY_CLASS.getClass().getClassLoader());
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File("removeNs.xslt"));
Transformer transformer = null;
try {
transformer = factory.newTransformer(xslt);
} catch (TransformerConfigurationException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Source text = new StreamSource(instr);
OutputStream ostr = new ByteArrayOutputStream();
try {
transformer.transform(text, new StreamResult(ostr));
} catch (TransformerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
event.setBody(ostr.toString().getBytes());
return event;
}
I'm removing NameSpace from my source xml with removeNs.xslt file. So that I can store that data into HDFS and later put into hive. When my interceptor run it throw below error :
ERROR org.apache.flume.source.jms.JMSSource: Unexpected error processing events
java.lang.NullPointerException
at test.intercepter.App.intercept(App.java:59)
at test.intercepter.App.intercept(App.java:82)
at org.apache.flume.interceptor.InterceptorChain.intercept(InterceptorChain.java:62)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:146)
at org.apache.flume.source.jms.JMSSource.doProcess(JMSSource.java:258)
at org.apache.flume.source.AbstractPollableSource.process(AbstractPollableSource.java:54)
at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
at java.lang.Thread.run(Thread.java:745)*
Can you suggest me what and where is the problem?
I found the solution. The problem was not anything else than new File("removeNs.xslt"). It was not able to find the location as I's not sure where to keep this file but later I get the flume agent path but as soon as I restart the flume agent it deletes all files which I kept in the flume agent dir. So I changed the code and kept the file material into my java code.

How to fetch data from Hbase table which is running on linux system and java progamme which is run on window Could not locate executable null\bin\

How to fetch data from Hbase table which is running on linux system and java progamme which is run on window Could not locate executable null\bin\
//
this is my code to connect
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.zookeeper.quorum", "192.168.20.129");
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("hbase.master", "192.168.20.129:60010");
Just add this method , call before connecting.
private static void workaround() {
//workaround for = java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
File workaround = new File(".");
System.getProperties().put("hadoop.home.dir", workaround.getAbsolutePath());
new File("./bin").mkdirs();
try {
new File("./bin/winutils.exe").createNewFile();
} catch (IOException e) {
logger.error(e);
}
}

File Overwrite issue when trying to transfer file using FTP

I have a FTP server with only List and Put permissions. But not having delete, overwrite and Rename permissions.
Now when I try to transfer a file using simple FTP using the following implementation
private boolean sendFileStreamHelper(InputStream inputStream, String nameOfFileToStore, String filetransferDestFolder) throws FileTransferException {
Log.info("Inside SendFile inputstream method to trasport the input stream of file " + nameOfFileToStore + " data to " + filetransferDestFolder);
BufferedOutputStream os = null;
FileObject fo = null;
try {
fo = getFileObject(nameOfFileToStore, filetransferDestFolder, ftpAuthDetails.getServerName(), ftpAuthDetails.getUsername(), ftpAuthDetails
.getPassword(), ftpAuthDetails.getPort());
fo.createFile();// create a file in the remote to transfer the file
os = new BufferedOutputStream(fo.getContent().getOutputStream());
FileUtil.readStream(inputStream, os);
return true;
} catch (Exception ex) {
Log.error("File transfer exception occurred while transferrig the file " + nameOfFileToStore + " to " + filetransferDestFolder, ex);
throw new FileTransferException(ex);
} finally {
if (os != null) {
try {
os.flush();
os.close();
} catch (IOException e) {
Log.warn(getClass(), " Error while closing the buffer output stream", e);
}
}
if (fo != null) {
try {
fo.close();
} catch (IOException e) {
Log.warn(getClass(), " Error while closing the File object", e);
}
}
closeCache(); // Close the VFS Manager instance
}
}
In the above code as the File is created in the remote using the File Object instance. Later to that I am trying to write the file with the Buffered stream. Here the systems acts as if it is writing to a file which is already created and as my server is not having any overwrite permission, throwing following error.
29 Jul 2012 21:03:06 [ERROR] FC_ClusteredScheduler_Worker-2(1) com.abc.filetransfer.FileTransferClient - .sendFileStreamHelper(FileTransferClient.java:170) - File transfer exception occurred while transferrig the file *******.txt to / ex-org.apache.commons.vfs2.FileSystemException: Could not write to "ftp://******:***#***.***.***.***/*********.txt"
org.apache.commons.vfs2.FileSystemException: Could not write to "ftp://******:***#***.***.***.***/*********.txt".
at org.apache.commons.vfs2.provider.AbstractFileObject.getOutputStream(AbstractFileObject.java:1439)
at org.apache.commons.vfs2.provider.DefaultFileContent.getOutputStream(DefaultFileContent.java:461)
at org.apache.commons.vfs2.provider.DefaultFileContent.getOutputStream(DefaultFileContent.java:441)
at com.abc.filetransfer.FileTransferClient.sendFileStreamHelper(FileTransferClient.java:164)
at com.abc.filetransfer.FileTransferClient.sendFile(FileTransferClient.java:131)
at com.abc.filetransfer.FileTransferClient.sendFile(FileTransferClient.java:103)
at com.abc.filetransfer.client.FTPTransferClient.sendFile(FTPTransferClient.java:65)
Caused by: org.apache.commons.vfs2.FileSystemException: Cant open output connection for file "ftp://******:***#***.***.***.***/*********.txt".
Reason: "**550 File unavailable. Overwrite not allowed by user profile**^M
at org.apache.commons.vfs2.provider.ftp.FtpFileObject.doGetOutputStream(FtpFileObject.java:648)
at org.apache.commons.vfs2.provider.AbstractFileObject.getOutputStream(AbstractFileObject.java:1431)
Please let me know how can I handle the file transfer using file Object, such that both File creation and writing the stream should happen at once.
I have resolved the issue.
Its pretty straight forward. In the below code
fo = getFileObject(nameOfFileToStore, filetransferDestFolder, ftpAuthDetails.getServerName(), ftpAuthDetails.getUsername(), ftpAuthDetails
.getPassword(), ftpAuthDetails.getPort());
fo.createFile();// create a file in the remote to transfer the file
os = new BufferedOutputStream(fo.getContent().getOutputStream());
FileUtil.readStream(inputStream, os);
I am creating a file first using the FileObject and then trying to write the BOS into the file.
Here while writing BOS to file system considers that we are trying to add data to a already existing file (as I am doing it in two separate steps, Creating a file and writing Data to the same) and returns the Error
**550 File unavailable. Overwrite not allowed by user profile*
I have removed the
fo.createFile()
as the BOS while writing the data will any way create a file if not available.
Thanks for your time.
Purushotham Reddy

Copying files from HDFS to local file system with JAVA

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
e.printStackTrace();
}
Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.

How to design my mapper?

I have to write a mapreduce job but I dont know how to go about it,
I have jar MARD.jar through which I can instantiate MARD objects.
Using which I call the mard.normalize file meathod on it i.e. mard.normaliseFile(bunch of arguments).
This inturn creates certain output file.
For the normalise meathod to run it needs a folder called myMard in the working directory.
So I thought that I would give the myMard folder as the in input path to hadoop job, but m not sure if that would help beacuse mard.normaliseFile(bunch of arguments) will search for the myMard folder in the working directory but it will not find it as (**this is what I think) the Mapper will only be able to access the content of files through the "values" obtained from the fileSplit, it cannot give direct access to the files in the myMard folder.
In short I have to execute the follwing code through the MapReduce
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
MARD mard = new MARD(setupFolder);
Text valuz = new Text();
IntWritable intval = new IntWritable();
File original = new File("Vca1652.txt");
File mardedxml = new File("Vca1652-mardedxml.txt");
File marded = new File("Vca1652-marded.txt");
mardedxml.createNewFile();
marded.createNewFile();
NormalisationStats stats;
try {
stats = mard.normaliseFile(original,mardedxml,marded,50.0);
//This meathod requires access to the myMardfolder
System.out.println(stats);
} catch (MARDException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help

Resources