TokensRegex pattern with negation of a user-defined "macro" - stanford-nlp

I am constructing a TokensRegex expression which is meant to capture text in the form of "N Maple St" of "W Mullholland Drive". The current expression is this:
{
ruleType: "tokens",
pattern: (/[A-Z]/ ([{ner:PERSON}|{tag:NNP}])+),
result: Concat($$0.text, "=", "STREET")
}
However, this also captures text like "A Honda Accord". I have defined a macro for all the different car brands, similar to:
$VEHICLES = "/[Hh]onda|[Tt]oyota/"
I want to incorporate the negation of the $VEHICLES macro into the TokensRegex expression, i.e. the pattern section above checks to see if the text captured by the {ner:PERSON} or {tag:NNP} tokens matches the $VEHICLES macro, and if it does, it is NOT a valid match.
Visually,
{
ruleType: "tokens",
pattern: (/[A-Z]/ ((([{ner:PERSON}|{tag:NNP}])&(!$VEHICLES))+),
// Matches the letter and the tokens and NOT anything in the macro.
// This pattern causes a ParseException when running CoreMapExpressionExtractor.createExtractorFromFile
// in my pipeline code.
result: Concat($$0.text, "=", "STREET")
}
Is there support for this functionality in TokensRegex?

Related

Is there a way to remove ALL special characters using Lucene filters?

Standard Analyzer removes special characters, but not all of them (eg: '-'). I want to index my string with only alphanumeric characters but referring to the original document.
Example: 'doc-size type' should be indexed as 'docsize' and 'type' and both should point to the original document: 'doc-size type'
It depends what you mean by "special characters", and what other requirements you may have. But the following may give you what you need, or point you in the right direction.
The following examples all assume Lucene version 8.4.1.
Basic Example
Starting with the very specific example you gave, where doc-size type should be indexed as docsize and type, here is a custom analyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.pattern.PatternReplaceFilter;
import java.util.regex.Pattern;
public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new WhitespaceTokenizer();
TokenStream tokenStream = source;
Pattern p = Pattern.compile("\\-");
boolean replaceAll = Boolean.TRUE;
tokenStream = new PatternReplaceFilter(tokenStream, p, "", replaceAll);
return new TokenStreamComponents(source, tokenStream);
}
}
This splits on whitespace, and then removes hyphens, using a PatternReplaceFilter. It works as shown below (I use 「 and 」 as delimiters to show where whitespaces may be part of the inputs/outputs):
Input text:
「doc-size type」
Output tokens:
「docsize」
「type」
NOTE - this will remove all hyphens which are standard keyboard hyphens - but not things such as em-dashes, en-dashes, and so on. It will remove these standard hyphens regardless of where they appear in the text (word starts, word ends, on their own, etc).
A Set of Punctuation Marks
You can change the pattern to cover more punctuation, as needed - for example:
Pattern p = Pattern.compile("[$^-]");
This does the following:
Input text:
「doc-size type $foo^bar」
Output tokens:
「docsize」
「type」
「foobar」
Everything Which is Not a Character or Digit
You can use the following to remove everything which is not a character or digit:
Pattern p = Pattern.compile("[^A-Za-z0-9]");
This does the following:
Input text:
「doc-size 123 %^&*{} type $foo^bar」
Output tokens:
「docsize」
「123」
「」
「type」
「foobar」
Note that this has one empty string in the resulting tags.
WARNING: Whether the above will work for you depends very much on your specific, detailed requirements. For example, you may need to perform extra transformations to handle upper/lowercase differences - i.e. the usual things which typically need to be considered when indexing text.
Note on the Standard Analyzer
The StandardAnalyzer actually does remove hyphens in words (with some obscure exceptions). In your question you mentioned that it does not remove them. The standard analyzer uses the standard tokenizer. And the standard tokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified here. There's a section discussing how hyphens in words are handled.
So, the Standard analyzer will do this:
Input text:
「doc-size type」
Output tokens:
「doc」
「size」
「type」
That should work with searches for doc as well as doctype - it's just a question of whether it works well enough for your needs.
I understand that may not be what you want. But if you can avoid needing to build a custom analyzer, life will probably be much simpler.

Stanford TokensRegex: how to set normalized annotation using normalized output of NER annotation?

I am creating a TokensRegex annotator to extract the number of floors a building has (just an example to illustrate my question). I have a simple pattern that will recognize both "4 floors" and "four floors" as instances of my custom entity "FLOORS".
I would also like to add a NormalizedNER annotation, using the normalized value of the number entity used in the expression, but I can't get it to work the way I want to:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
tokens = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$TokensAnnotation" }
ENV.defaults["ruleType"] = "tokens"
{
pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) )
}
The rules above only set the NormalizedNER fields in the output to the text value of the number, "4" and "four" for the above examples respectively. Is there a way to use the NUMBER entity's normalized value ("4.0" both for "4" and "four") as the normalized value for my "FLOORS" entity?
Thanks in advance.
Try changing
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.text) )
to
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $$1.normalized) )
Annotate takes three arguments
arg1 = object to annotate (typically the matched tokens indicated by $0)
arg2 = annotation field
arg3 = value (in this case you want the NormalizedNER field instead of the text field)
With $$1.normalized as you suggested, running on the input "The
building has seven floors" yields the following error message:
Annotating file test.txt { Error extracting annotation from seven
floors }
It might be because the NamedEntityTagAnnotation key is not already present for the token represented by $$1. I suppose, before running TokensRegex, you'd want to make sure that your numeric tokens - either "four" or "4" in this case - have the corresponding normalized value - "4.0" in this case - set to their NamedEntityTagAnnotation key.
Also, could you please direct me to where I can find more information
on the possible 3rd arguments of Annotate()? Your Javadoc page for
TokensRegex expressions doesn't list $$n.normalized (perhaps it needs
updating?)
I believe, that what $$n.normalized would do, would be to retrieve the value which, in Java code, would be the equivalent of coreLabel.get(edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation.class) where coreLabel is of type CoreLabel and corresponds with $$n in TokensRegex.
This is because of the following line in your TokensRegex: normalized = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NormalizedNamedEntityTagAnnotation" }
The correct answer is based on #AngelChang's answer and comment, I'm just posting it here for the sake of ordeliness.
The rule has to be modified so the 2nd Annotate() action's 3rd parameter is $1[0].normalized:
{
pattern: ( ( [ { ner:NUMBER } ] ) /floor(s?)/ ),
action: ( Annotate($0, ner, "FLOORS"), Annotate($0, normalized, $1[0].normalized) )
}
According to #Angel's comment:
$1[0].normalized is the "normalized" field of the 0th token of the 1st
capture group (as a CoreLabel). The $$1 gives you back the
MatchedGroupInfo which has the "text" field but not the normalized
field (since that is on the actual token)

Ruby Regex Group Replacement

I am trying to perform regular expression matching and replacement on the same line in Ruby. I have some libraries that manipulate strings in Ruby and add special formatting characters to it. The formatting can be applied in any order. However, if I would like to change the string formatting, I want to keep some of the original formatting. I'm using regex for that. I have the regular expression matching correctly what I need:
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, 'New Text')
However, what I really want is the matching from the first grouping found in:
(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))
to be appended to New Text and replaced as opposed to just New Text. I'm trying to reference the match in the form of
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, '\1' + 'New Text')
but my understanding is that \1 only works when using \d or \k. Is there any way to reference that specific capturing group in my replacement string? Additionally, since I am using an asterik for the [], I know that this grouping could occur more than once. Therefore, I would like to have the last matching occurrence yielded.
My expected input/output with a sample is:
Input: "\e[1mHello there\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
Input: "\e[1mHello there\e[44m\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
So the last grouping is found and appended.
You can use the following regex with back-reference \\1 in the replacement:
reg = /(\\e\[(?:[0-9]{1,2}|[3,9][0-8])m)+Text/
mystring = "\\e[1mHello there\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
mystring = "\\e[1mHello there\\e[44m\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
Output of the IDEONE demo:
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
Mind that your input has backslash \ that needs escaping in a regular string literal. To match it inside the regex, we use double slash, as we are looking for a literal backslash.

Ruby regex - using optional named backreferences

I am trying to write a Ruby regex that will return a set of named matches. If the first element (defined by slashes) is found anywhere later in the string then I want the match to return that 2nd match onward. Otherwise, return the whole string. The closest I've gotten is (?<p1>top_\w+).*?(?<hier>\k<p1>.*) which doesn't work for the 3rd item. I've tried regex ifthen-else constructs but Rubular says it's invalid. I've tried (?<p1>[\w\/]+?)(?<hier>\k<p1>.*) which correct splits the 1st and 4th lines but doesn't work for the others. Please note: I want all results to return as the same named reference so I can iterate through "hier".
Input:
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
Output:
hier = top_cat/mouse/dog/elephant/horse
hier = top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
hier = top_bat/car[0]
hier = top_2/top_1/top_3/top_4/dog
Problem
The reason it does not match the second line is because the second instance of hat does not end with a slash, but the first instance does.
Solution
Specify that there is a slash between the first and second match
Regex
(top_.*)/(\1.*$)|(^.*$)
Replacement
hier = \2\3
Example
Regex101 Permalink
More info on the Alternation token
To explain how the | token works in regex, see the example: abc|def
What this regex means in plain english is:
Match either the regex below (attempting the next alternative only if this one fails)
Match the characters abc literally
Or match the regex below (the entire match attempt fails if this one fails to match)
Match the characters def literally
Example
Regex: alpha|alphabet
If we had a phrase "I know the alphabet", only the word alpha would be matched.
However, if we changed the regex to alphabet|alpha, we would match alphabet.
So you can see, alternation works in a left-to-right fashion.
paths = %w(
top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
test/test
)
paths.each do |path|
md = path.match(/^([^\/]*).*\/(\1(\/.*|$))/)
heir = md ? md[2] : path
puts heir
end
Output:
top_cat/mouse/dog/elephant/horse
top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
top_bat/car[0]
top_2/top_1/top_3/top_4/dog
test

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources