How to make a leading token optional in a RegexNER expression? - stanford-nlp

I have a very simple use case where I need to add an NER annotation to a sequence of two words where the first word is optional.
For example, I need to annotate both "net income" and "income" phrases as a same NE type.
With ordinary regular expressions the following expression works:
([Nn]et\s)?[Ii]ncome
However, in RegexNER it does not work.
The effect that the above regex has in RegexNER is that the word "income" is annotated in both sequences, but the word "net" is not annotated in the sequence "net income", which is not the result that I need.
That is sort of expected, knowing that RegexNER matches a sequence of regular expressions over a sequence of tokens, not a single regular expression over a single string.
However, the following syntax does not work either:
([Nn]et)? [Ii]ncome
The effect that this expression has is that the sequence "net income" is annotated entirely, but just "income" is not annotated at all.
This is unexpected, since this seems like a very simple use case.
I tried different ways to denote the initial token as a group and also tried different quantifiers - it still does not work.
Any help with making the first token optional will be appreciated.

Let me answer my own question. This is not a direct solution, it's a workaround.
The following expression will work, but only with TokensRegex, not with RegexNER:
/[Nn]et/? /[Ii]ncome/
I am not sure why this is the case, maybe RegexNER does not support quantifiers at the token level the same way TokensRegex does.

Related

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?
You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

how to implement complex pattern matching in Spring batch using PatternMatchingCompositeLineMapper

How can we implement pattern matching in Spring Batch, I am using org.springframework.batch.item.file.mapping.PatternMatchingCompositeLineMapper
I got to know that I can only use ? or * here to create my pattern.
My requirement is like below:
I have a fixed length record file and in each record I have two fields at 35th and 36th position which gives record type
for example below "05" is record type which is at 35th and 36th position and total length of record is 400.
0000001131444444444444445589868444050MarketsABNAKKAAAAKKKA05568551456...........
I tried to write regular expression but it does not work, i got to know only two special character can be used which are * and ? .
In that case I can only write like this
??????????????????????????????????05?????????????..................
but it does not seem to be good solution.
Please suggest how can I write this solution, Thanks a lot for help in advance
The PatternMatchingCompositeLineMapper uses an instance of org.springframework.batch.support.PatternMatcher to do the matching. It's important to note that PatternMatcher does not use true regular expressions. It uses something closer to ant patterns (the code is actually lifted from AntPathMatcher in Spring Core).
That being said, you have three options:
Use a pattern like you are referring to (since there is no short hand way to specify the number of ? that should be checked like there is in regular expressions).
Create your own composite LineMapper implementation that uses regular expressions to do the mapping.
For the record, if you choose option 2, contributing it back would be appreciated!

How to get first character that is causing reg expression not to match

We have one quite complex regular expression which checks for string structure.
I wonder if there is an easy way to find out which character in the string that is causing reg expression not to match.
For example,
string.match(reg_exp).get_position_which_fails
Basically, the idea is how to get "position" of state machine when it gave up.
Here is an example of regular expression:
%q^[^\p{Cc}\p{Z}]([^\p{Cc}\p{Zl}\p{Zp}]{0,253}[^\p{Cc}\p{Z}])?$
The short answer is: No.
The long answer is that a regular expression is a complicated finite state machine that may be in a state trying to match several different possible paths simultaneously. There's no way of getting a partial match out of a regular expression without constructing a regular expression that allows partial matches.
If you want to allow partial matches, either re-engineer your expression to support them, or write a parser that steps through the string using a more manual method.
You could try generating one of these automatically with Ragel if you have a particularly difficult expression to solve.

Regular expression for google keyword

I am trying to construct a regular expression to detect a keyword in google search string. i.e. a string from google for a search term "amazing car" is
https://www.google.pl/#hl=pl&output=search&sclient=psy-ab&q=amazing+car&oq=amazing+car&aq=f& ... etc
I tried with this regular expression to detect a keyword car:
(google\.).+(&|\?)q=(car)
But this does not seem to work correctly. Am I missing something?
Thank you very much for advice
Your expression would match only if the query started with "car". If you use ".*" in the group, the greedy .+ will make the "q=" match the "oq=" later in the URL.
This may work for you:
(google\.).+(&|\?)q=([^&]*car)
Or, safer though more complex, apply this regexp which will capture the keyword in the only capture group:
https?://(?:[^/]+\.)?google\.[^/]+/[^?]*[?#](?:.*&)?q=([^&]*)
Or, if your regexp engine doesn't support, non-capture groups, use this:
https?://([^/]+\.)?google\.[^/]+/[^?]*[?#](.*&)?q=([^&]*)
and read your keyword in the third group.

How can I write a regex to repeatedly capture group within a larger match?

I'm getting a regex headache, so hopefully someone can help me here. I'm doing some file syntax conversion and I've got this situation in the files:
OpenMarker
keyword some expression
keyword some expression
keyword some expression
keyword some expression
keyword some expression
CloseMarker
I want to match all instances of "keyword" inside the markers. The marker areas are repeated and the keyword can appear in other places, but I don't want to match outside of the markers. What I don't seem to be able to work out is how to get a regex to pull out all the matches. I can get one to do the first or the last, but not to get all of them. I believe it should be possible and it's something to do with repeated capture groups -- can someone show me the light?
I'm using grepWin, which seems to support all the bells and whistles.
You could use:
(?<=OpenMarker((?!CloseMarker).)*)keyword(?=.*CloseMarker)
this will match the keyword inside OpenMarker and CloseMarker (using the option "dot matches newline").
sed -n -e '/OpenMarker[[:space:]]*CloseMarker/p' /path/to/file | grep keyword should work. Not sure if grep alone could do this.
There are only a few regex engines that support separate captures of a repeated group (.NET for example). So your best bet is to do this in two steps:
First match the section you're interested in: OpenMarker(.*?)CloseMarker (using the option "dot matches newline").
Then apply another regex to the match repeatedly: keyword (.*) (this time without the option "dot matches newline").

Resources