Regular expression for google keyword - expression

I am trying to construct a regular expression to detect a keyword in google search string. i.e. a string from google for a search term "amazing car" is
https://www.google.pl/#hl=pl&output=search&sclient=psy-ab&q=amazing+car&oq=amazing+car&aq=f& ... etc
I tried with this regular expression to detect a keyword car:
(google\.).+(&|\?)q=(car)
But this does not seem to work correctly. Am I missing something?
Thank you very much for advice

Your expression would match only if the query started with "car". If you use ".*" in the group, the greedy .+ will make the "q=" match the "oq=" later in the URL.
This may work for you:
(google\.).+(&|\?)q=([^&]*car)
Or, safer though more complex, apply this regexp which will capture the keyword in the only capture group:
https?://(?:[^/]+\.)?google\.[^/]+/[^?]*[?#](?:.*&)?q=([^&]*)
Or, if your regexp engine doesn't support, non-capture groups, use this:
https?://([^/]+\.)?google\.[^/]+/[^?]*[?#](.*&)?q=([^&]*)
and read your keyword in the third group.

Related

Elastic search: Create tokens separated by either <space> or "-" and greater than 3 chars

In my elastic search setup, I would like to create tokens separated by either " " or "-" and greater than 3 chars.
I believe pattern tokenizer can work but I am not able to create the regular expression.
Please help me in regular expression
You should be able to use the following regex in the pattern field of your pattern tokenizer:
([^\s-]{3,})
The \s means any whitespace character.
The - means the literal dash character.
Putting the two of them between [^ and ] means match any character that isn't the ones in the list (in this case, anything not whitespace and not a dash)
The {3,} means the previous match has to occur 3 times or more.
The parenthesis around the entire statement means you want to capture what is inside, and the pattern tokenizer pulls its tokens from the matching groups of the regex.
You can play with this regex here and see how it will split your string:
https://regex101.com/r/2e9p34/1
On a side note, there may be other better ways to do this that will better handle edge cases you aren't thinking of, but I decided to answer your question exactly as you asked it. I highly recommend exploring all of the options ElasticSearch provides for its analyzers for your use case to see which one best fits your needs.
Hope this helps!

What does `(?:| ...)` mean in a Ruby regular expression?

While reading Engineering long-lasting software: an Agile approach using SaaS and cloud computing I came across the following regex (Chapter 5, Section 5.3 Introducing Cucumber and Capybara):
/^(?:|I )am on (.+)$/
I know about the non-capturing (?: ...) syntax, but what I don’t understand is the meaning of the first pipe character after the colon. Is it a typo? Does it serve any particular purpose?
The pipe in regex means alternative. In this case, it is expressing alternation between an empty string "" and the string "I ".
It is just the or. It can match either nothing or I (with a space). The rest is non-capturing group like you mention.
The regex matches something like I am on a diet and also am on a diet and in the above examples, captures a diet in the first group.
Try it out on Rubular - http://rubular.com/r/q3RFEoxj1e
(?:|something)
("nothing / empty string or the match")
Is exactly the same thing as:
(?:something)?
("the match, once or none")
In other words: the non-capturing subpattern is optional.

How to discover a date or a number near a word - only with regex within regex

I am still learning the intrinsics of regex, and am wondering if it is possible with a single regex to find a number that is at a provided distance from a word.
Consider the following text
DateClient
15-01-20130060 15-01-20140010 15-01-20150020
I want that my regex matches just 15-01-2013.
I know I can have the full DateClient 15-01-2013 with DateClient\W+\d{2}-\d{2}-\d{4}, and then apply a regex afterwards, but i'm trying to build a configurable agnostic system, that gives power to the user, and so I would like to have a single regex expression that just matches 15-01-2013.
Is this even feasible?
Any suggestions?
You can use a capturing group :
DateClient\W+(\d{2}-\d{2}-\d{4})
Example in javascript (you didn't specify a language) :
var str = "DateClient\n15-01-20130060 15-01-20140010 15-01-20150020";
var date = str.match(/DateClient\W+(\d{2}-\d{2}-\d{4})/)[1];
EDIT (following the addition of the Ruby tag) :
In Ruby you can use
(?<=DateClient\W)(\d{2}-\d{2}-\d{4})
Demonstration
Check out lookbehind for matching only the date. However, lookbehind support of your environment can be limited.
Or you could just use a capturing group, which you will be able to extract from the match result.

(Ruby) parsing a string with RegEx

This is the string that I want to parse: 2 Sep 27 Sep 28 SOME TEXT HERE 35.00
I want to parse it into a list so that the values look like:
list[0] = 'Sep 28'
list[1] = 'SOME TEXT HERE'
list[2] = '35.00'
The RegEx that I've been working on:
^\d{1}\s{1}[a-zA-Z]{3}\s{1}\d{2}\s{1}([a-zA-Z]{3}\s{1}\d{2})\s{1}([a-zA-Z0-9]*\s{1})+(\d+.\d+)
My values are:
list[0] = 'Sep 28'
list[1] = 'HERE'
list[2] = '35.00'
The list[1] value is off. I'm also probably not parsing the spaces right, but I couldn't find any guidance in the "Pickaxe" book or online.
Your problem is in your second capture group:
([a-zA-Z0-9]*\s{1})+
The parenthesized group is repeated, matching each of the words 'SOME', 'TEXT', and 'HERE' individually, leaving your second capture group with only the final match, 'HERE'.
You need to put the + inside the capturing parenthesized groups, and use non-capturing parentheses (?:...) to enclose your existing group. Non-capturing parentheses, which use (?: to start the group and ) to end the group, are a way in a regular expression to group parts of your match together without capturing the group. You can use repetition operators (+, *, {n}, or {n,m}) on a non-capturing group and then capture the entire expression:
((?:[a-zA-Z0-9]*\s{1})+)
In total:
/^\d{1}\s{1}[a-zA-Z]{3}\s{1}\d{2}\s{1}([a-zA-Z]{3}\s{1}\d{2})\s{1}((?:[a-zA-Z0-9]*\s{1})+)(\d+.\d+)/
As a side note, this is a pretty clunky regex. You never really need to specify {1} in a regex as a single match is the default. Similarly, \d\d is one character less typing than \d{2}. Also, you probably just want \w instead of [a-zA-Z0-9]. Since you don't seem to care about case, you probably just want to use the /i option and simplify the letter character classes. Something like this is a more idiomatic regular expression:
/^\d [a-z]{3} \d\d ([a-z]{3} \d\d) ((?:\w* )+)(\d+.\d+)/i
Finally, though the Ruby documentation for regular expressions is a little thin, Ruby uses somewhat standard Perl-compatible regular expressions, and you can find more information about regular expressions generally at regular-expressions.info
You may have also been here and tried this tool, but I would highly recommend Rubular. It offers very quick string parsing.
It looks like you already got the specific answer to your question, so I just wanted to drop this in for other people coming by so they can know where to go test their regex or just practice.

How can I write a regex to repeatedly capture group within a larger match?

I'm getting a regex headache, so hopefully someone can help me here. I'm doing some file syntax conversion and I've got this situation in the files:
OpenMarker
keyword some expression
keyword some expression
keyword some expression
keyword some expression
keyword some expression
CloseMarker
I want to match all instances of "keyword" inside the markers. The marker areas are repeated and the keyword can appear in other places, but I don't want to match outside of the markers. What I don't seem to be able to work out is how to get a regex to pull out all the matches. I can get one to do the first or the last, but not to get all of them. I believe it should be possible and it's something to do with repeated capture groups -- can someone show me the light?
I'm using grepWin, which seems to support all the bells and whistles.
You could use:
(?<=OpenMarker((?!CloseMarker).)*)keyword(?=.*CloseMarker)
this will match the keyword inside OpenMarker and CloseMarker (using the option "dot matches newline").
sed -n -e '/OpenMarker[[:space:]]*CloseMarker/p' /path/to/file | grep keyword should work. Not sure if grep alone could do this.
There are only a few regex engines that support separate captures of a repeated group (.NET for example). So your best bet is to do this in two steps:
First match the section you're interested in: OpenMarker(.*?)CloseMarker (using the option "dot matches newline").
Then apply another regex to the match repeatedly: keyword (.*) (this time without the option "dot matches newline").

Resources