Apache Storm performance issues running StanfordNLP bolts - stanford-nlp

So we have a bolt that would take data and try to parse it using StanfordNLP. Main objective is to identify entities, classify words in a sentence and try to find mentions. Here is a set up of the StanfordCoreNLP object. Notice that I am adding twitter model here as well.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("pos.model", "gate-EN-twitter.model");
props.put("dcoref.score", true);
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
At first, it took a while to start up so we increased supervisor.worker.start.timeout.secs to 300 inside of conf/storm.yaml.
Now, while it runs it is just so slow... Plus, we are getting weird exceptions. Like this
java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:403) ~[na:1.8.0_05]
at java.util.ArrayList.get(ArrayList.java:416) ~[na:1.8.0_05]
at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.funkyFindLeafWithApproximateSpan(RuleBasedCorefMentionFinder.java:418) ~[stormjar.jar:na]
at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.findSyntacticHead(RuleBasedCorefMentionFinder.java:346) ~[stormjar.jar:na]
at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.findHead(RuleBasedCorefMentionFinder.java:274) ~[stormjar.jar:na]
at edu.stanford.nlp.dcoref.RuleBasedCorefMentionFinder.extractPredictedMentions(RuleBasedCorefMentionFinder.java:100) ~[stormjar.jar:na]
at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:107) ~[stormjar.jar:na]
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:67) ~[stormjar.jar:na]
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:881) ~[stormjar.jar:na]
Any best practices out there on how to set up StanfordNLP bolts inside of Apache Storm?
Thanks!

Related

ConcurrentModificationException using Collectors.toSet()

I have a Set<Objects> that I want to filter by class to obtain a Set<Foo> (i.e., the subset of Objects that are instanceof Foo). To do this with Java-8 I wrote
Set<Foo> filtered = initialSet.parallelStream().filter(x -> (x instanceof Foo)).map(x -> (Foo) x).collect(Collectors.toSet());
This is throwing a ConcurrentModificationException:
java.util.ConcurrentModificationException
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
The problem is apparently the collector but I have no clue as to why.
Found (and fixed) the problem. The issue is the initialSet. This is being obtained by a call to a 3rd party library (in this case a JGraphT DirectedPseudoGraph instance). The underlying graph was being modified on another thread. Since the initialSet is returned by reference the result is the ConcurrentModificationException.
The real problem is therefore not the stream processing but using a graph that returned a Set that wasn't thread-safe. The solution was to use AsSynchronizedGraph.

How to write output in parquet fileformat in a MapReduce job?

I am looking to write MapReduce output in parquet fileformat using parquet-mr library as something like below :
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(ParquetOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
ParquetOutputFormat.setOutputPath(job, new Path(args[2]));
ParquetOutputFormat.setCompression(job, CompressionCodecName.GZIP);
SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE);
SkipBadRecords.setAttemptsToStartSkipping(conf, 0);
job.submit();
However, I keep getting errors like these :
2018-02-23 09:32:58,325 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.NullPointerException: writeSupportClass should not be null
at org.apache.parquet.Preconditions.checkNotNull(Preconditions.java:38)
at org.apache.parquet.hadoop.ParquetOutputFormat.getWriteSupport(ParquetOutputFormat.java:350)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:293)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.<init>(ReduceTask.java:548)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:622)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
I understand that writeSupportClass needs to be passed/set as something like
ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
but can I ask how can specify schema,implement ProtoWriteSupport or any other WriteSupport classes out there? What methods do I need to implement and are there any examples of doing this in a correct way?
If it helps, my MR job's output should look like & stored in parquet format:
Text INTWRITABLE
a 100
Try ParquetOutputFormat.setWriteSupportClass(job, ProtoWriteSupport.class);
ProtoWriteSupport<T extends MessageOrBuilder>
Implementation of WriteSupport for writing Protocol Buffers.
Check Javadoc for list of nested default classes available.
The CDH Tutorial on using parquet file format with MapReduce, Hive, HBase, and Pig.

Making index/type mapping raises internal error with JEST

I'm using JEST to access Elasticsearch and so far its working fine. Now I want to manage index/type mappings from my application so I followed an example from JEST web site but I'm getting an error as bellow.
RootObjectMapper.Builder rootObjectMapperBuilder = new RootObjectMapper.Builder("person_mapping").add(
new StringFieldMapper.Builder("lastname").store(true));
Builder builder = new DocumentMapper.Builder("indexName", null, rootObjectMapperBuilder);
The error is raised on the last line staring with new DocumentMapper.Builder ... . Its rather something internal but not sure how to fix this.
java.lang.NullPointerException: null
at org.elasticsearch.Version.indexCreated(Version.java:481) ~[elasticsearch-1.7.2.jar:na]
at org.elasticsearch.index.mapper.core.NumberFieldMapper.<init>(NumberFieldMapper.java:206) ~[elasticsearch-1.7.2.jar:na]
at org.elasticsearch.index.mapper.core.IntegerFieldMapper.<init>(IntegerFieldMapper.java:132) ~[elasticsearch-1.7.2.jar:na]
at org.elasticsearch.index.mapper.internal.SizeFieldMapper.<init>(SizeFieldMapper.java:104) ~[elasticsearch-1.7.2.jar:na]
at org.elasticsearch.index.mapper.internal.SizeFieldMapper.<init>(SizeFieldMapper.java:99) ~[elasticsearch-1.7.2.jar:na]
at org.elasticsearch.index.mapper.DocumentMapper$Builder.<init>(DocumentMapper.java:182) ~[elasticsearch-1.7.2.jar:na]
Does anyone have some working example maintaining mappings for Elasticsearch with JEST?
EDIT #1: Integration tests are not helping me :-(
I have looked at JEST integration test focused on mapping here https://github.com/searchbox-io/Jest/blob/master/jest/src/test/java/io/searchbox/indices/PutMappingIntegrationTest.java#L46 and it doesnt help. I dont know where comes client() ... based on others searches it seems its something from native JAVA API and not REST ? Any idea how to use it or where client() comes from?
GetSettingsResponse getSettingsResponse =
client().admin().indices().getSettings(new GetSettingsRequest().indices(INDEX_NAME)).actionGet();
DocumentMapper documentMapper = new DocumentMapper
.Builder(INDEX_NAME, getSettingsResponse.getIndexToSettings().get(INDEX_NAME), rootObjectMapperBuilder).build(null);
SOLVED!
DocumentMapper.Builder requires Settings parameter. Null doesnt work here. Settings can be created manualy like this
Settings indexSettings = ImmutableSettings.settingsBuilder()
.put("number_of_shards", 1)
.put("number_of_replicas", 1)
.put("index.version.created",99999)
.build();
Builder builder = new DocumentMapper.Builder("indexName",indexSettings, rootObjectMapperBuilder);
No I can see no null pointer error.

How incorporate Storm component-specific configuration data?

I have a Storm topology containing spouts/bolts.
There is some configuration data that is specific to a particular spout and
also a particular bolt that I would like to use (i.e. read from a config file)
so that it is not hard coded. Examples of config data is a filename that the
spout is to read from and a filename that a bolt is to write to.
I think config data is passed into the open and prepare methods.
How can I incorporate the component-specific data from a configuration file?
There are at least two ways to do this:
1) Include application-specific configuration in Storm config, which will be available during IBolt.prepare() ISpout.open() method calls. One strategy you could use is to have application prefix for the configuration keys, avoiding potential conflicts.
Config conf = new backtype.storm.Config();
// Storm-specific configuration
// ...
// ..
// .
conf.put("my.application.configuration.foo", "foo");
conf.put("my.application.configuration.bar", "foo");
StormSubmitter.submitTopology(topologyName, conf, topology);
2) Include component configuration during Spout/Bolt constructor.
Properties properties = new java.util.Properties();
properties.load(new FileReader("config-file"));
BaseComponent bolt = new MyBoltImpl(properties);

Use MRUnit and AVRO together

I have created a Mapper & Reducer which use AVRO for input, map-output en reduce output. When creating a MRUnit test i get the following stacktrace:
java.lang.NullPointerException
at org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.deepCopy(MockOutputCollector.java:74)
at org.apache.hadoop.mrunit.mock.MockOutputCollector.collect(MockOutputCollector.java:110)
at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper$MockMapContext.write(MockMapContextWrapper.java:119)
at org.apache.avro.mapreduce.AvroMapper.writePair(AvroMapper.java:22)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:29)
at com.bol.searchrank.phase.day.DayMapper.doMap(DayMapper.java:1)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:16)
at org.apache.avro.mapreduce.AvroMapper.map(AvroMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:200)
at org.apache.hadoop.mrunit.mapreduce.MapReduceDriver.run(MapReduceDriver.java:207)
at com.bol.searchrank.phase.day.DayMapReduceTest.shouldProduceAndCountTerms(DayMapReduceTest.java:39)
The driver is initialized as follows (i have created a Avro MapReduce API implementation):
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer());
Adding a configuration object with io.serialization won't help:
Configuration configuration = new Configuration();
configuration.setStrings("io.serializations", new String[] {
AvroSerialization.class.getName()
});
driver = new MapReduceDriver<AvroWrapper<Pair<Utf8, LiveTrackingLine>>, NullWritable, AvroKey<Utf8>, AvroValue<Product>, AvroWrapper<Pair<Utf8, Product>>, NullWritable>().withMapper(new DayMapper()).withReducer(new DayReducer()).withConfiguration(configuration);
I use Hadoop & MRUnit 0.20.2-cdh3u2 from Cloudera and Avro MapRed 1.6.3.
You are getting a NPE because the SerializationFactory is not finding an acceptable class implementing Serialization in io.serializations.
MRUnit had several bugs related to serializations besides Writable including MRUNIT-45, MRUNIT-70, MRUNIT-77, MRUNIT-86 at https://issues.apache.org/jira/browse/MRUNIT. These bugs involved the conf not getting passed to the SerializationFactory constructor correctly or the code required a default constructor from the Key or Value which all Writables have. All of these fixes appear in Apache MRUnit 0.9.0-incubating which will get released sometime this week.
Cloudera's 0.20.2-cdh3u2 MRUnit is close to Apache MRUnit 0.5.0-incubating. I think that your code may still be a problem even in 0.9.0-incubating, please email your full code example to mrunit-user#incubator.apache.org and the Apache MRUnit project will be happy to take a look at it
This will compile now MRUNIT-99 relaxes the restriction on K2 type parameter to not have to be Comparable

Resources