TokensRegex: Tokens are null after retokenization - stanford-nlp

I'm experimenting with Stanford NLP's TokensRegex and try to find dimensions (e.g. 100x120) in a text. So my plan is to first retokenize the input to further split these tokens (using the example provided in retokenize.rules.txt) and then to search for the new pattern.
After doing the retokenization, however, only null-values are left that replace the original string:
The top level annotation
[Text=100x120 Tokens=[null-1, null-2, null-3] Sentences=[100x120]]
The retokenization seems to work fine (3 tokens in result), but the values are lost. What can I do to maintain the original values in the tokens list?
My retokenize.rules.txt file is (as in the demo):
tokens = { type: "CLASS", value:"edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
options.matchedExpressionsAnnotationKey = tokens;
options.extractWithTokens = TRUE;
options.flatten = TRUE;
ENV.defaults["ruleType"] = "tokens"
ENV.defaultStringPatternFlags = 2
ENV.defaultResultAnnotationKey = tokens
{ pattern: ( /\d+(x|X)\d+/ ), result: Split($0[0], /x|X/, TRUE) }
The main method:
public static void main(String[] args) throws IOException {
//...
text = "100x120";
Properties properties = new Properties();
properties.setProperty("tokenize.language", "de");
properties.setProperty("annotators", tokenize,retokenize,ssplit,pos,lemma,ner");
properties.setProperty("customAnnotatorClass.retokenize", "edu.stanford.nlp.pipeline.TokensRegexAnnotator");
properties.setProperty("retokenize.rules", "retokenize.rules.txt");
StanfordCoreNLP stanfordPipeline = new StanfordCoreNLP(properties);
runPipeline(pipelineWithRetokenize, text);
}
And the pipeline:
public static void runPipeline(StanfordCoreNLP pipeline, String text) {
Annotation annotation = new Annotation(text);
pipeline.annotate(annotation);
out.println();
out.println("The top level annotation");
out.println(annotation.toShorterString());
//...
}

Thanks for letting us know. The CoreAnnotations.ValueAnnotation is not being populated and we'll update TokenRegex to populate the field.
Regardless, you should be able to use TokenRegex to retokenize as you have planned. Most of the pipeline does not depending on the ValueAnnotation and uses the CoreAnnotations.TextAnnotation instead. You can use the CoreAnnotations.TextAnnotation to get the text for the new tokens (each token is a CoreLabel so you can access it using token.word() as well).
See TokensRegexRetokenizeDemo for example code on how to get the different annotations out.

Related

How to filter object lists and create another filtered lists from it

I receive a List of MediaDTO and this Object has two attributes:
String sizeType and String URL.
In 'sizeType' comes the image´s size: small, medium, large, and thumbnail.
So I have to filter the sizeType of these objects and create 4 new lists based on them.
This is how I get the List<MediaDTO> mediaDTO:
medias=[MediaDTO(sizeType=THUMBNAIL, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/thumbnail/celular-iphone-11-azul.png), MediaDTO(sizeType=SMALL, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/small/celular-iphone-11-azul.png), MediaDTO(sizeType=SMALL, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/medium/celular-iphone-11-azul.png), MediaDTO(sizeType=LARGE, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/large/celular-iphone-11-azul.png), MediaDTO(sizeType=THUMBNAIL, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/thumbnail/celular-iphone-11-vermelho.png), MediaDTO(sizeType=SMALL, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/small/celular-iphone-11-vermelho.png), MediaDTO(sizeType=MEDIUM, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/medium/celular-iphone-11-vermelho.png), MediaDTO(sizeType=LARGE, liveloUrl=https://s3.sao01.cloud-object-storage.appdomain.cloud/catalog-media-storage/id-source/productId/skuseller2/large/celular-iphone-11-vermelho.png)]
I achieved filtering for one of the sizes. This works!
However, I could not figure out how can I filter over the 4 sizes and create 4 new lists of it.
If I fix some error another appears ... so I´m really stuck.
And by the way I´ve been searching for a solution on the internet and in the forum for a couple of days but did´nt find something that fits.
If someone might help, I´d really be grateful.
I was thinking about using a 'forEach' to filter but even like that, I could filter just one size.
Thanks in advance.
**This is what I got till now: **
public class ProcessProductDTO {
String processId;
OperationProcess operation;
String categoryId;
ProductDTO productDTO;
}
public class ProductDTO {
String id;
Boolean active;
String displayName;
String longDescription;
List<MediaDTO> medias;
List<AttributeDTO> attributes;
}
public class MediaDTO {
String sizeType;
String liveloUrl;
}
public Properties toOccProductPropertiesDTO(ProcessProductDTO processProductDTO) throws JsonProcessingException {
String pSpecs = convertAttributes(processProductDTO.getProductDTO().getAttributes());
//List<String> medias = convertMedias(processProductDTO.getProductDTO().getMedias());
return Properties.builder()
.id(processProductDTO.getProductDTO().getId()) .active(processProductDTO.getProductDTO().getActive())
.listPrices(new HashMap())
.p_specs(pSpecs)
//.medias(medias)
.displayName(processProductDTO.getProductDTO()
.getDisplayName())
.longDescription(processProductDTO.getProductDTO().getLongDescription())
.build(); }
private String convertAttributes(List<AttributeDTO> attributes) throws JsonProcessingException {
Map<String, String> attribs = attributes.stream()
.collect(Collectors.toMap(AttributeDTO::getName, AttributeDTO::getValue));
return objectMapper.writeValueAsString(attribs);
}
private List<MediaDTO> convertMedias(ProcessProductDTO processProduct, List<MediaDTO> mediaDTO){
List<MediaDTO> filteredList = processProduct.getProductDTO().getMedias();
Set<String> filterSet = mediaDTO.stream().map(MediaDTO::getSizeType).collect(Collectors.toSet());
return filteredList.stream().filter(url -> filterSet.contains("SMALL")).collect(Collectors.toList());
}
UPDATE
I got the following result:
private Properties toOccProductPropertiesDTO(ProcessProductDTO processProductDTO) throws JsonProcessingException {
String pSpecs = convertAttributes(processProductDTO.getProductDTO().getAttributes());
MediaOccDTO medias = convertMedias(processProductDTO.getProductDTO().getMedias());
return Properties.builder()
.id(processProductDTO.getProductDTO().getId())
.active(processProductDTO.getProductDTO().getActive())
.listPrices(new HashMap())
.p_specs(pSpecs)
.medias(medias)
.displayName(processProductDTO.getProductDTO().getDisplayName())
.longDescription(processProductDTO.getProductDTO().getLongDescription())
.build();
}
private MediaOccDTO convertMedias(List<MediaDTO> mediaDTOs){
String smallImageUrls = generateOccUrl(mediaDTOs, ImageSizeType.SMALL);
String mediumImageUrls = generateOccUrl(mediaDTOs, ImageSizeType.MEDIUM);
String largeImageUrls = generateOccUrl(mediaDTOs, ImageSizeType.LARGE);
String thumbImageUrls = generateOccUrl(mediaDTOs, ImageSizeType.THUMB);
return MediaOccDTO.builder()
.p_smallImageUrls(smallImageUrls)
.p_mediumImageUrls(mediumImageUrls)
.p_largeImageUrls(largeImageUrls)
.p_thumbImageUrls(thumbImageUrls)
.build();
}
private String generateOccUrl(List<MediaDTO> mediaDTOs, ImageSizeType imageSizeType){
return mediaDTOs.stream()
.filter(m -> m.getSizeType().equals(imageSizeType))
.map(MediaDTO::getLiveloUrl)
.reduce(",", String::concat);
}
The problem is:
the comparison: m.getSizeType().equals(imageSizeType)
is always false, so the list gets created empty...
Though the question is laborious, I could think of the requirement being, you need to create 4 new lists based on sizeType.
Stream collector, can collect the results to a single data structure. It can be a list, set, Map etc.
Since you need 4 lists based on sizeType, you will need to pass through the stream 4 times to create 4 lists.
Another Alternate will be to create a Map<SizeType, List<MediaDTO>>
This can be achieved through,
mediaDTO.stream().collect(Collectors.toMap(i -> i.getSizeType(), i->i)
I think the toMap doesn't collect the values in a list. We need groupingBy instead.
mediaDTO.stream()
.collect(Collectors.groupingBy(MediaDTO::getSizeType));

Change the scalar style used for all multi-line strings when serialising a dynamic model using YamlDotNet

I am using the following code snippet to serialise a dynamic model of a project to a string (which is eventually exported to a YAML file).
dynamic exportModel = exportModelConvertor.ToDynamicModel(project);
var serializerBuilder = new SerializerBuilder();
var serializer = serializerBuilder.EmitDefaults().DisableAliases().Build();
using (var sw = new StringWriter())
{
serializer.Serialize(sw, exportModel);
string result = sw.ToString();
}
Any multi-line strings such as the following:
propertyName = "One line of text
followed by another line
and another line"
are exported in the following format:
propertyName: >
One line of text
followed by another line
and another line
Note the extra (unwanted) line breaks.
According to this YAML Multiline guide, the format used here is the folded block scalar style. Is there a way using YamlDotNet to change the style of this output for all multi-line string properties to literal block scalar style or one of the flow scalar styles?
The YamlDotNet documentation shows how to apply ScalarStyle.DoubleQuoted to a particular property using WithAttributeOverride but this requires a class name and the model to be serialised is dynamic. This also requires listing every property to change (of which there are many). I would like to change the style for all multi-line string properties at once.
To answer my own question, I've now worked out how to do this by deriving from the ChainedEventEmitter class and overriding void Emit(ScalarEventInfo eventInfo, IEmitter emitter). See code sample below.
public class MultilineScalarFlowStyleEmitter : ChainedEventEmitter
{
public MultilineScalarFlowStyleEmitter(IEventEmitter nextEmitter)
: base(nextEmitter) { }
public override void Emit(ScalarEventInfo eventInfo, IEmitter emitter)
{
if (typeof(string).IsAssignableFrom(eventInfo.Source.Type))
{
string value = eventInfo.Source.Value as string;
if (!string.IsNullOrEmpty(value))
{
bool isMultiLine = value.IndexOfAny(new char[] { '\r', '\n', '\x85', '\x2028', '\x2029' }) >= 0;
if (isMultiLine)
eventInfo = new ScalarEventInfo(eventInfo.Source)
{
Style = ScalarStyle.Literal
};
}
}
nextEmitter.Emit(eventInfo, emitter);
}
}

learning java stream, how to pass a value from the outer loop to the nested loop in a functional way

I have map of a map of strings. This map is a parsing of a json object and represents the criteria entered by the user to filter a list in the UI.
In the rest service I want to populate an object with data comes from this map. Unfortunately I cannot change queryModel Object. Query Model object has a list of filters. Each filter has a list of fields and a list of operations to be applied to the field. My goal is to convert the following code with java 8 stream.
for(Map.Entry<String,Map<String,String>> entry: filters.entrySet()) {
Filter filter = new Filter();
filter.setFields(new ArrayList<String>());
filter.getFields().add(entry.getKey());
filter.setValues(new ArrayList<String>());
filter.setOperators(new ArrayList<String>());
if (entry.getValue() != null) {
for(String key : entry.getValue().keySet()) {
if(key.equals("value")) {
filter.getValues().add(entry.getValue().get(key));
}
else if(key.equals("matchMode")){
filter.getOperators().add(entry.getValue().get(key));
}
}
queryModel.getFilters().add(filter);
}
As you can see I first set the name of the field in the fields list and then for that field I loop in the values to get the value entered and the match mode. In a functional I don't know ho to save the field of the outer loop to set it in the filter object created in the inner loop.
That was my attempt
public static Filter getFilter(Map.Entry<String,String> entry) {
Filter filter = new Filter();
filter.setFields(new ArrayList<String>());
filter.getFields().add(entry.getKey());
filter.setValues(new ArrayList<String>());
filter.setOperators(new ArrayList<String>());
if(entry.getKey().equals("value")) {
filter.getValues().add(entry.getValue());
}
else if(entry.getKey().equals("matchMode")){
filter.getOperators().add(entry.getValue());
}
return filter;
}
List<Filter> filterList = filters.entrySet().stream()
.filter( stringMapEntry -> stringMapEntry.getValue() != null)
.flatMap( entry -> entry.getValue().entrySet().stream())
.map (innerEntry-> QueryModelAdapter.getFilter(innerEntry))
.collect (Collectors.toList());
queryModel.setFilters (filterList);
I need in QueryModelAdapter.getFilter the entry of the flat map. How can I do that?
Before I say anything, be polite when asking questions. Nobody gets paid for answering questions here. All are doing it for their pleasure.
So, be nice to them at least with your words.
Alright, I think your question is more suitable for CodeReview than StackOverflow.
One thing to note, You can't rewrite your legacy java projects to have every single line with lambdas and streams.
Sometimes, it's better the old fashioned way than the new features.
You don't need to iterate a Map to retrieve its matching value. You can remove that Inner-loop.
Let's take your current class (whatever the class you copied the code from) named it as RespectOthers.java
private static Filter getEmptyFilter(){
Filter filter = new Filter();
filter.setFields(new ArrayList<String>());
filter.setValues(new ArrayList<String>());
filter.setOperators(new ArrayList<String>());
return filter;
}
private static Filter setKeyAndValues(Filter inputFilterObj, Map.Entry<String,Map<String,String>> entry, QueryModel queryModel){
inputFilterObj.setFields(new ArrayList<String>());
inputFilterObj.getFields().add(entry.getKey());
if (entry.getValue() != null) {
inputFilterObj.getValues().add(entry.getValue().get("value"));
inputFilterObj.getOperators().add(entry.getValue().get("matchMode"));
queryModel.getFilters().add(inputFilterObj);
}
return inputFilterObj;
}
List<Filter> finalOutput = filters.entrySet().stream()
.map(e -> RespectOthers.setKeyAndValues(RespectOthers.getEmptyFilter(), e, myQueryModel))
.collect(Collectors.toList());

concating multiple processing instruction results in a for loop in XQuery,XPath

I need to read all processing instructions with NAME="CONTENTTYPE" and I want to read #VALUE and concatenate all the Values and return in XQuery/XPath.
My XML:
<REG >
<MARKER MRKEID="SLREG:7.1" MRKTYPE="LD DU" MRKDATE="20130909" MRKTIME="10402688"/>
<?METADATA NAME="CONTENTTYPE" VALUE="STATUTE"?>
<?METADATA NAME="CONTENTTYPE" VALUE="LEGISLATIVEDOCUMENT"?>
<?METADATA NAME="CONTENTTYPE" VALUE="PRIMARYSOURCE"?>
<?METADATA NAME="SLTAXTYPE" VALUE="PRIMARYSOURCE"?>
</REG>
ExpectedOutput:
STATUTE
LEGISLATIVEDOCUMENT
PRIMARYSOURCE
Appreciate your help in writing the XQuery/XPath to get the output as above.
Thanks in Advance.
Regards,
Hari
//processing-instruction('METADATA')[matches(., 'NAME="CONTENTTYPE" VALUE="[^"]*"')]/replace(substring-after(., 'VALUE="'), '"', ''). That's XPath 2.0.
Tagging with JDOM helped me find this.
Long answer coming.... XPath does not have the native ability to parse the 'standard' way of adding 'attributes' to ProcessingInstructions. If you want to do the concatenation of the values as part of a single XPath expression I think you are out of luck.... actually, Martin's answer looks promising, but it will return a number of String values, not ProcessingInsructions. JDOM 2.x will need a Filters.string() on the XPath.compile(...) and you will get a List<String> result to path.evaluate(doc).... I think it's simpler to do it outside of the XPath. Especially given that there's only limited support for XPath2.0 by using the Saxon library with JDOM 2.x
As for doing it programmatically, JDOM 2.x helps a fair amount. Taking your example XML I did it two ways, the first way uses a custom Filter on the XPath resultset. The second way does effectively the same thing but restricting the PI's further in the loop.
public static void main(String[] args) throws Exception {
SAXBuilder saxb = new SAXBuilder();
Document doc = saxb.build(new File("data.xml"));
// This custom filter will return PI's that have the NAME="CONTENTTYPE" 'pseudo' attribute...
#SuppressWarnings("serial")
Filter<ProcessingInstruction> contenttypefilter = new AbstractFilter<ProcessingInstruction>() {
#Override
public ProcessingInstruction filter(Object obj) {
// because we know the XPath expression selects Processing Instructions
// we can safely cast here:
ProcessingInstruction pi = (ProcessingInstruction)obj;
if ("CONTENTTYPE".equals(pi.getPseudoAttributeValue("NAME"))) {
return pi;
}
return null;
}
};
XPathExpression<ProcessingInstruction> xp = XPathFactory.instance().compile(
// search for all METADATA PI's.
"//processing-instruction('METADATA')",
// The XPath will return ProcessingInstruction content, which we
// refine with our custom filter.
contenttypefilter);
StringBuilder sb = new StringBuilder();
for (ProcessingInstruction pi : xp.evaluate(doc)) {
sb.append(pi.getPseudoAttributeValue("VALUE")).append("\n");
}
System.out.println(sb);
}
This second way uses the simpler and pre-defined Filters.processingInstruction() but then does the additional filtering manually....
public static void main(String[] args) throws Exception {
SAXBuilder saxb = new SAXBuilder();
Document doc = saxb.build(new File("data.xml"));
XPathExpression<ProcessingInstruction> xp = XPathFactory.instance().compile(
// search for all METADATA PI's.
"//processing-instruction('METADATA')",
// Use the pre-defined filter to set the generic type
Filters.processinginstruction());
StringBuilder sb = new StringBuilder();
for (ProcessingInstruction pi : xp.evaluate(doc)) {
if (!"CONTENTTYPE".equals(pi.getPseudoAttributeValue("NAME"))) {
continue;
}
sb.append(pi.getPseudoAttributeValue("VALUE")).append("\n");
}
System.out.println(sb);
}

Problem in using Solr WordDelimiterFilter

I am doing some test using WordDelimiterFilter in Solr but it doesn't preserve the protected list of words which I pass to it. Would you please inspect the code and the output example and suggest which part is missing or used badly?
with running this code:
private static Analyzer getWordDelimiterAnalyzer() {
return new Analyzer() {
#Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
WordDelimiterFilterFactory wordDelimiterFilterFactory = new WordDelimiterFilterFactory();
HashMap<String, String> args = new HashMap<String, String>();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "1");
args.put("catenateNumbers", "1");
args.put("catenateAll", "0");
args.put("luceneMatchVersion", Version.LUCENE_32.name());
args.put("language", "English");
args.put("protected", "protected.txt");
wordDelimiterFilterFactory.init(args);
ResourceLoader loader = new SolrResourceLoader(null, null);
wordDelimiterFilterFactory.inform(loader);
/*List<String> protectedWords = new ArrayList<String>();
protectedWords.add("good bye");
protectedWords.add("hello world");
wordDelimiterFilterFactory.inform(new LinesMockSolrResourceLoader(protectedWords));
*/
return wordDelimiterFilterFactory.create(stream);
}
};
}
input text:
hello world
good bye
what is your plan for future?
protected strings:
good bye
hello world
output:
(hello,startOffset=0,endOffset=5,positionIncrement=1,type=)
(world,startOffset=6,endOffset=11,positionIncrement=1,type=)
(good,startOffset=12,endOffset=16,positionIncrement=1,type=)
(bye,startOffset=17,endOffset=20,positionIncrement=1,type=)
(what,startOffset=21,endOffset=25,positionIncrement=1,type=)
(is,startOffset=26,endOffset=28,positionIncrement=1,type=)
(your,startOffset=29,endOffset=33,positionIncrement=1,type=)
(plan,startOffset=34,endOffset=38,positionIncrement=1,type=)
(for,startOffset=39,endOffset=42,positionIncrement=1,type=)
(future,startOffset=43,endOffset=49,positionIncrement=1,type=)
You are using a standard tokenizer which at least tokenizes on a whitespace level so you will always have "hello world" be split to "hello" and "world".
TokenStream stream = new StandardTokenizer(Version.LUCENE_32, reader);
See Lucene Documentation:
public final class StandardTokenizer extends Tokenizer
A grammar-based tokenizer constructed with JFlex
This should be a good tokenizer for most European-language documents:
Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of
a token.
Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
Recognizes email addresses and internet hostnames as one token.
The word delimiter protected word list is meant for something like:
ISBN2345677 to be split in ISBN 2345677
text2html not to be split in text 2 html (because text2html was added to protected words)
If you really want to do something like you mentioned you may use the KeywordTokenizer. But you have to do the complete splitting by yourself.

Resources