Sentence detection using opennlp on hadoop - hadoop

I want to do sentence detection using OPenNLP and Hadoop. I have implemented same on Java successfully. Want to implement same on Mapreduce platform. Can anyone help me out?

I have done this two different ways.
One way is to push out your Sentence detection model to each node to a standard dir (ie /opt/opennlpmodels/), and at the class level in your mapper class read in the serialized model, and then use it appropriately in your map or reduce function.
Another way is to put the model in a database or the distributed cache (as a blob or something... I have used Accumulo to store Document categorization models before like this). then at the class level make the connection to the database and get the model as a bytearrayinputstream.
I have used Puppet to push out the models, but use whatever you typically use to keep files up to date on your cluster.
depending on your hadoop version you may be able to sneak the model in as a property on jobsetup and then only the master (or wherever you launch jobs from) will need to have the actual model file on it. I've never tried this.
If you need to know how to actually use the OpenNLP sentence detector let me know and I'll post an example.
HTH
import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;
public class SentenceDetection {
SentenceDetector sd;
public Span[] getSentences(String docTextFromMapFunction) throws Exception {
if (sd == null) {
sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
}
/**
* this gives you the actual sentences as a string array
*/
// String[] sentences = sd.sentDetect(docTextFromMapFunction);
/**
* this gives you the spans (the charindexes to the start and end of each
* sentence in the doc)
*
*/
Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
/**
* you can do this as well to get the actual sentence strings based on the spans
*/
// String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
return sentenceSpans;
}
}
HTH... just make sure the file is in place. There are more elegant ways of doing this but this works and it's simple.

Related

JMeter Java API - How to collect the result of a test

I know how to prepare a test plan and run it in JMeter using the Java API, there are quite a number of examples on how to do that. What's missing is a way to collect the results directly. I know it's possible to save the results in a .jtl file but that would require me to open the file after saving it and parse it (depending on its format). I have seen the API provides quite a number of Result classes but I wasn't able to figure out how to use them. I have also tried debugging to try to figure out what classes were involved and try to understand an execution model.
Any help would be really appreciated
Right, I am not sure if that is the right answer, I think there's no right answer because it really depends on your needs. At least I managed to understand a bit more debugging a test execution.
Basically what I ended up doing was to extend the ResultCollector and add it to the instance of TestPlan. What the collector does is to store the events once received and print them at the end of the test (but at this point you can do whatever you want with it)
If you have better approaches please let me know (I guess a more generic approach would be to implement SampleListener and TestStateListener without using the specific implementation of the ResultCollector)
import java.util.LinkedList;
import org.apache.jmeter.reporters.ResultCollector;
import org.apache.jmeter.samplers.SampleEvent;
public class RealtimeResultCollector extends ResultCollector{
LinkedList<SampleEvent> collectedEvents = new LinkedList<>();
/**
* When a test result is received, store it internally
*
* #param event
* the sample event that was received
*/
#Override
public void sampleOccurred(SampleEvent event) {
collectedEvents.add(event);
}
/**
* When the test ends print the response code for all the events collected
*
* #param host
* the host where the test was running from
*/
#Override
public void testEnded(String host) {
for(SampleEvent e: collectedEvents){
System.out.println("TEST_RESULT: Response code = " + e.getResult().getResponseCode()); // or do whatever you want ...
}
}
}
And in the code of the main or wherever you have created your test plan
...
HashTree ht = new HashTree();
...
TestPlan tp = new TestPlan("MyPlan");
RealtimeResultCollector rrc = new RealtimeResultCollector();
// after a lot of confugration, before executing the test plan ...
ht.add(tp);
ht.add(ht.getArray()[0], rtc);
For the details about the code above you can find a number of examples on zgrepcode.com

Is it possible to access the underlying org.apache.hadoop.mapreduce.Job from a Scalding job?

In my Scalding job, I have code like this:
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
class MyJob(args: Args) extends Job(args) {
FileInputFormat.setInputPathFilter(???, classOf[MyFilter])
// ... rest of job ...
}
class MyFilter extends PathFilter {
def accept(path:Path): Boolean = true
}
My problem is that the first argument of the FileInputFormat.setInputPathFilter method needs to be of type org.apache.hadoop.mapreduce.Job. How can I access the Hadoop job object in my Scalding job?
Disclaimer
There is no way to extract Job class. But you can (but should never do!) extract JobConf. After that you will be able to use FileInputFormat.setInputPathFilter from mapreduce.v1 API (org.apache.hadoop.mapred.JobConf) which will permit to archive the filtering.
But I suggest you not to do this. Read the end of the answer,
How can you do this?
Override stepStrategy method of scalding.Job to implement FlowStepStrategy. For example this implementation permits to change the name of mapreduce job
override def stepStrategy: Option[FlowStepStrategy[_]] = Some(new FlowStepStrategy[AnyRef]{
override def apply(flow: Flow[AnyRef], predecessorSteps: util.List[FlowStep[AnyRef]], step: FlowStep[AnyRef]): Unit =
step.getConfig match {
case conf: JobConf =>
# here you can modify the JobConf of each job.
conf.setJobName(...)
case _ =>
}
})
Why should one not do this?
Accessing JobConf to add a path filtering will work only if you are using the specific Sources and will break if you are using some others. Also you will be mixing different levels of abstraction. And I am not starting on how are you suppose to know what JobConf you actually need to modify (most of scalding jobs I saw are multi-steps)
How should one resolve this problem?
I suggest you to look closely on a type of Source you are using. I am pretty sure there is a function to apply a path filtering there during or before Pipe (or TypedPipe) construction.

How to allow spark to ignore missing input files?

I want to run a spark job (spark v1.5.1) over some generated S3 paths containing avro files. I'm loading them with:
val avros = paths.map(p => sqlContext.read.avro(p))
Some of the paths will not exist though. How can I get spark to ignore those empty paths? Previously I've used this answer, but I'm not sure how to use that with the new dataframe API.
Note: I'm ideally looking for a similar approach to the linked answer that just makes input paths optional. I don't particularly want to have to explicitly check for the existence of paths in S3 (since that's cumbersome and may make development awkward), but I guess that's my fallback if there's no clean way to implement this now.
I would use the scala Try type in order to handle the possibility of failure when reading a directory of avro files. With 'Try' we can make the possibility of failure explicit in our code, and handle it in a functional manner:
object Main extends App {
import scala.util.{Success, Try}
import org.apache.spark.{SparkConf, SparkContext}
import com.databricks.spark.avro._
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("example"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//the first path exists, the second one doesn't
val paths = List("/data/1", "/data/2")
//Wrap the attempt to read the paths in a Try, then use collect to filter
//and map with a single partial function.
val avros =
paths
.map(p => Try(sqlContext.read.avro(p)))
.collect{
case Success(df) => df
}
//Do whatever you want with your list of dataframes
avros.foreach{ df =>
println(df.collect())
}
sc.stop()
}

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Hadoop Custom Input format with the new API

I'm a newbie to Hadoop and I'm stuck with the following problem. What I'm trying to do is to map a shard of the database (please don't ask why I need to do that etc) to a mapper, then do certain operation on this data, output the results to reducers and use that output again to do the second phase map/reduce job on the same data using the same shard format.
Hadoop does not provide any input method to send a shard of the database. You can only send line by line using LineInputFormat and LineRecordReader. NLineInputFormat doesn't also help in this case. I need to extend FileInputFormat and RecordReader classes to write my own InputFormat. I have been advised to use LineRecordReader since the underlying code already deals with the FileSplits and all the problems associated with splitting the files.
All I need to do now is to override the nextKeyValue() method which I don't exactly know how.
for(int i=0;i<shard_size;i++){
if(lineRecordReader.nextKeyValue()){
lineValue.append(lineRecordReader.getCurrentValue().getBytes(),0,lineRecordReader.getCurrentValue().getLength());
}
}
The above code snippet is the one that wrote but somehow doesn't work well.
I would suggest to put into your input files connection strings and some other indications where to find the shard.
Mapper will take this information, connect to the database and do a job. I would not suggest t o convert result sets to hadoop's writable classes - it will hinder performance.
The problem I see to be addressed - is to have enough splits of this relatively small input.
You can simply create enough small files with a few shards references each, or you can tweak input format to build small splits. Second way will be more flexible.
What I did, is something like this. I wrote my own record reader to read n lines at a time and send them to mappers as input
public boolean nextKeyValue() throws IOException,
InterruptedException {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
if (!lineRecordReader.nextKeyValue()) {
return false;
}
lineKey = lineRecordReader.getCurrentKey();
lineValue = lineRecordReader.getCurrentValue();
sb.append(lineValue.toString());
sb.append(eol);
}
lineValue.set(sb.toString());
//System.out.println(lineValue.toString());
return true;
// throw new UnsupportedOperationException("Not supported yet.");
}
how do you thin

Resources