Tokens regex: expression for last token/no more tokens - stanford-nlp

I am writing a tokens regex where I need to check that there are no more tokens following. I am using []{0} to do that, but it does not work.
Specifically, for a phrase like this, "on Tuesday or after", my tokens regex is
/on|at|for/ [ner:/DATE|TIME/] /and|or/ /after|later/ []{0}
But, this expression also matches "on Tuesday or after Thursday", which is semantically different from "on Tuesday or after". Any ideas how to check for no tokens following, or to re-write the regex to match the first phrase and not the second? Thanks.

You can try:
/on|at|for/ [ner:/DATE|TIME/] /and|or/ /after|later/ $

Related

How to tokenize a sentence based on maximum number of words in Elasticsearch?

I have a string like "This is a beautiful day"
What tokenizer or what combination between tokenizer and token filter should I use to produce output that contains terms that have a maximum of 2 words? Ideally, the output should be:
"This, This is, is, is a, a, a beautiful, beautiful, beautiful day, day"
So far, I have tried all built-in tokenizer, the 'pattern' tokenizer seems the one I can use, but I don't know how to write a regex pattern for my case. Any help?
Seems that you're looking for shingle token filter it does exactly what you want.
As what #Oleksii said.
in your case max_shingle_size = 2 (which is the default), and min_shingle_size = 1.

Negative lookahead to ignore boilerplate text in an email thread

I'm trying to write a negative lookahead for a regular expression that will ignore lines of boilerplate in the text of an email, specifically the bit that goes like:
> On Sat, Apr 27, 2013 at 11:39 PM, Jane Smith <jane.smith#example.com> wrote:
I want to match all the digits that are not in my negative lookahead. I tried this:
(?!(?:^>?*\sOn\s.*wrote:\s?)$)\d
But that always matches inside that line. I'm particularly confused because this regex:
(?:^>?*\sOn\s.*wrote:\s?)$
matches that entire line. Obviously I'm missing something, but I have no idea what it is. Thanks for any help.
try this pattern, but don't forget to remove the empty matches:
> On .*+\n> wrote:|(\d++)

What does `(?:| ...)` mean in a Ruby regular expression?

While reading Engineering long-lasting software: an Agile approach using SaaS and cloud computing I came across the following regex (Chapter 5, Section 5.3 Introducing Cucumber and Capybara):
/^(?:|I )am on (.+)$/
I know about the non-capturing (?: ...) syntax, but what I don’t understand is the meaning of the first pipe character after the colon. Is it a typo? Does it serve any particular purpose?
The pipe in regex means alternative. In this case, it is expressing alternation between an empty string "" and the string "I ".
It is just the or. It can match either nothing or I (with a space). The rest is non-capturing group like you mention.
The regex matches something like I am on a diet and also am on a diet and in the above examples, captures a diet in the first group.
Try it out on Rubular - http://rubular.com/r/q3RFEoxj1e
(?:|something)
("nothing / empty string or the match")
Is exactly the same thing as:
(?:something)?
("the match, once or none")
In other words: the non-capturing subpattern is optional.

Regular expression for google keyword

I am trying to construct a regular expression to detect a keyword in google search string. i.e. a string from google for a search term "amazing car" is
https://www.google.pl/#hl=pl&output=search&sclient=psy-ab&q=amazing+car&oq=amazing+car&aq=f& ... etc
I tried with this regular expression to detect a keyword car:
(google\.).+(&|\?)q=(car)
But this does not seem to work correctly. Am I missing something?
Thank you very much for advice
Your expression would match only if the query started with "car". If you use ".*" in the group, the greedy .+ will make the "q=" match the "oq=" later in the URL.
This may work for you:
(google\.).+(&|\?)q=([^&]*car)
Or, safer though more complex, apply this regexp which will capture the keyword in the only capture group:
https?://(?:[^/]+\.)?google\.[^/]+/[^?]*[?#](?:.*&)?q=([^&]*)
Or, if your regexp engine doesn't support, non-capture groups, use this:
https?://([^/]+\.)?google\.[^/]+/[^?]*[?#](.*&)?q=([^&]*)
and read your keyword in the third group.

Regular expression match that excludes characters inside parenthesis

I have the following types of strings.
BILL SMITH (USA)
WINTHROP (FR)
LORD AT WAR (GB)
KIM SMITH
With these strings, I have the following constraints:
1. all caps
2. can be 2 to 18 charters long
3. should not have any white spaces or carriage returns at the end
4. the country abbreviation inside the parens should be excluded
5. some of the names will not have the country in parens and they should be matched too
After applying my regular expression I'd like to get the following:
BILL SMITH (USA) => BILL SMITH
WINTHROP (FR) => WINTHROP
LORD AT WAR (GB) = LORD AT WAR
KIM SMITH => KIM SMITH
I came up with the following regular expression but I'm not getting any matches:
String.scan(\([A-Z \s*]{1,18})(^?!(\([A-Z]{1,3}\)))\)
I been banging my head on this for a while so if anyone can point error I'd appreciated it.
UPDATE:
I've gotten some great responses, however, so far none of the regular expression solutions have met all the constraints. The tricky part seems to be that some of the string has the country in parenthesis and some don't. In one case strings without the country was not being matched and in another it was returning the correct string along with the country abbreviation without the parenthesis. (See the comments on the second response.) One point of clarification: All of the strings that I will be matching will be the start point of the string. Not sure if that helps or not. Thanks again for all your help.
Here's one solution:
^((?:[A-Z]|\s){2,18}+?)(?:\s\([A-Z]+\))?$
See it on Rubular. Note that it counts 18 characters before the parenthesis - not sure how you want it to behave specificically. If you want to make sure the whole line isn't more than 18 characters, I suggest you just do unless line.length < 18 ... Similarly, if you want to make sure there is no whitespace at the end, I recommend you use line.strip. That'll greatly reduce the complexity of the Regexp you need and make your code more readable.
Edit: also works when no parentheses are used after the name.
The biggest error is that you wrote (^?!...) where you meant (?=...). The former means "an optional start-of-line anchor, followed by !, followed by ..., inside a capture group"; the latter means "a position in the string that is followed by ...". Fixing that, as well as makin a few other tweaks, and adding the requirement that the initial string end with a letter, we get:
[A-Z\s]{1,17}[A-Z])(?=\s*\([A-Z]{1,3}\)
Update based on OP comments: Since this will always match at the start of a string, you can use \A to "anchor" your pattern to the start of the string. You can then get rid of the lookahead assertion. This:
\A[A-Z][A-Z\s]{0,16}[A-Z]
matches start-of-string, followed by an uppercase letter, followed by up to 16 characters that are either uppercase letters or whitespace characters, followed by an uppercase letter.
You can also just use gsub to remove the part(s) you don't want. To remove everything in parenthesis you could do:
str.gsub(/\s*\([^)]*\)/, '')

Resources