Apache Storm Trident .each() function explanation - apache-storm

I want to use Apache Storm's TridentTopology in a project. I am finding it difficult to understand the .each() function from the storm.trident.Stream class. Below is the example code given in their tutorial for reference:
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.parallelismHint(6);
I didn't understand the signature of the method .each(). Below is what I understood. Please correct me if I am wrong and also give some more information for my knowledge.
.each()
The first parameter takes the fields which are correlated keys to the
emitted values from spout and returned from the getOutputFields()
method in the spout. I still don't know why is that parameter used
for.
The second parameter is the class extending the BaseFunction. It
processes the tuple.
The third parameter understanding is similar to the first parameter.

The first parameter is a projection on the input tuples. In your example, only the field with name "sentence" in provided to Split. If your source emits tuple with schema Fields("first", "sentence", "third") you can only access "sentence" in Split. Furthermore, "sentence" will have index zero (and not one) in Split. Pay attention that it is not a projection on the output -- all field will remain in the output tuples! It's just a limited view on the whole tuple within Split.
The last parameter is the schema of the Values given to emit() within Split. This field names are appended as new attribute to the output tuples. Thus, the output tuple's schema is the input tuple's schema (original, not projected by the first parameter) plus the fields of this last parameter.
See section "Function" in the documentation: https://storm.apache.org/releases/0.10.0/Trident-API-Overview.html

Related

Iterate through tokens and find the entity for a token

Problem
After running CoreNLP over some text, I want to reconstruct a sentence adding the POS-tag for each Token and grouping the tokens that form an entity.
This could be easily done if there was a way to see which entity a Token belongs to.
Aproach
One option I was considering now was going through sentence.tokens() and finding the index in a list containing only the Tokens from all the CoreEntityMentions for that sentence. Then I could see which CoreEntityMention that Token belongs to, so I can group them.
Another option could be to look the offsets of each Token in the sentence and compare it to the offset of each CoreEntityMention.
I think the question is similar to what was asked here, but since it was a while ago, maybe the API has changed since.
This is the setup:
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner");
pipeline = new StanfordCoreNLP(props);
String text = "Some text with entities goes here";
CoreDocument coreDoc = new CoreDocument(text);
// annotate the document
pipeline.annotate(coreDoc);
for (CoreSentence sentence : coreDoc.sentences()) {
// Code goes here
List<CoreEntityMention> em : sentence.entityMentions();
}
Each token in an entity mention contains an index to which entity mention in the document it corresponds to.
cl.get(CoreAnnotations.EntityMentionIndexAnnotation.class);
I'll make a note to add a convenience method for this future versions.

Print Probabilities from CoreNLP Pipeline

I'm aware of the functionality of using printProbs from a classifier to print the probabilities that a particular token is a particular ner type. However, how can I access the CRFClassifier used by the CoreNLP pipeline in the bottom code to actually call the printProb method?
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
One possible approach I would say is to have your own custom ner annotator which resembles very closely the CRFClassifier and add that to your pipeline instead of ner. So basically start with a copy of that code and then from within there you can access the CRFClassifier methods which contain the cliques and the probabilities
One could also use reflection (hahaha)
NERCombinerAnnotator nerAnnotator = (NERCombinerAnnotator) StanfordCoreNLP.getExistingAnnotator(Annotator.STANFORD_NER);
Field field = nerAnnotator.getClass().getDeclaredField("ner");
field.setAccessible(true);
NERClassifierCombiner classifier = (NERClassifierCombiner) field.get(nerAnnotator);
Field field2 = ner.getClass().getSuperclass().getDeclaredField("baseClassifiers");
field2.setAccessible(true);
// one of these will be the CRFClassifier used
List baseClassifiers = (List) field2.get(ner);
You should also realize that a number of the usual reflection exceptions could be raised using this code.

Multiple parallel Increments on Parse.Object

Is it acceptable to perform multiple increment operations on different fields of the same object on Parse Server ?
e.g., in Cloud Code :
node.increment('totalExpense', cost);
node.increment('totalLabourCost', cost);
node.increment('totalHours', hours);
return node.save(null,{useMasterKey: true});
seems like mongodb supports it, based on this answer, but does Parse ?
Yes. One thing you can't do is both add and remove something from the same array within the same save. You can only do one of those operations. But, incrementing separate keys shouldn't be a problem. Incrementing a single key multiple times might do something weird but I haven't tried it.
FYI you can also use the .increment method on a key for a shell object. I.e., this works:
var node = new Parse.Object.("Node");
node.id = request.params.nodeId;
node.increment("myKey", value);
return node.save(null, {useMasterKey:true});
Even though we didn't fetch the data, we don't need to know the previous value in order to increment it on the database. Note that you don't have the data so can't access any other necessary data here.

DOORS DXL issue looping through a filtered dataset

I have a script in which I filter the data in a module by a certain attribute value. When I then loop through these objects, for now, I am displaying the absolute number of the objects in an infoBox. However, the script is displaying absolute numbers of objects that are not in the dataset. Upon further investigation, I found that the extra absolute numbers were for each table within the entire module. I can't figure out why the script would include these tables when they are not in the filtered module data. I have even tried manually filtering the module on this attribute value then use the "Tools -> Edit DXL" to loop through the resulting items and it still displays the numbers for the tables that are not included. Why would it do this?
Here's my code:
bm2 = moduleVar
Filter fltr = contains(attribute "RCR_numbers", sRCRNum, false);
filtering on;
set(bm2, fltr);
for oObj in document(bm2) do {
absNum = oObj."Absolute Number";
infoBox("Object #" absNum ".");
}
I have also tried removing the document cast so it says "for oObj in bm2 do" instead, but this doesn't change the output. Why is the code giving me objects that are not in the filter? Any help would be greatly appreciated since this is a high priority issue for my project and I'm out of ideas myself.
Chris
In the DOORS 9.6.1 DXL Reference Manual you can see that:
for object in document
Assigns the variable o to be each successive
object in module. It is equivalent to the for object in module loop,
except that it includes table header objects, but not the row header
objects nor cells.
So, you must either use for object in module or, within your existing loop, test the hidden attribute TableType - this will be set to TableNone for anything that is not part of a table, table headers included.

Split node.js event stream based on unique id in event body

I have a node.js-based location service that produces events as users move about, and I want to use RxJS to process these events and look for arbitrarily complex patterns like a user enters a region and visits 2 points of interest within 1 hour.
To begin with I need to split the stream of events base on unique user ids (from event body), but I am not finding any stock RxJS functions that will do this.
filter() would require that all uuids be known beforehand which is not desirable.
groupBy() looks like it would need to process the entire sequence prior to returning the grouped observables, which is not possible.
I'm thinking that maybe I need to build a custom observable that maintains a map of uuids to observables, and instantiate new observables as required. Each of these observables would then need to undergo identical processing in search of the pattern match, and ultimately trigger some action when a user's movements match the pattern. One of the obvious challenges here is I have a dynamically growing map of observables being produced as user enter the system and move about.
Any ideas how something like this could be achieved with RxJS?
I think you are misunderstanding how groupBy() works. It will generate a new Observable every time a new key is generated, and if the key already exists, it will just be pushed to the existing Observable.
So for your problem it should look something like this:
var source = getLocationEvents();
var disposable = new Rx.CompositeDisposable();
disposable.add(
source.groupBy(function(x) { return x.userid; })
.map(function(x) {
return x.map(function(ev) { /*Process the the event*/ });
})
.subscribe(function(group) {
disposable.add(group.subscribe(/*Do something with the event*/));
});

Resources