I am trying to fix a bit of regex I have for a chatops bot for lita. I have the following regex:
/^(?:how\s+do\s+I\s+you\s+get\s+far\s+is\s+it\s+from\s+)?(.+)\s+to\s+(.+)/i
This is supposed to capture the words before and after 'to', with optional words in front that can form questions like: How do I get from x to y, how far from x to y, how far is it from x to y.
expected output:
match 1 : "x"
match 2 : "y"
For the most part my optional words work as expected. But when I pull my response matches, I get the words leading up to the first capture group included.
So, how far is it from sfo to lax should return:
sfo and lax.
But instead returns:
how far is it from sfo and lax
Your glitch is that the first chunk of your regex doesn't make sense.
To choose from multiple options, use this syntax:
(a|b|c)
What I think you're trying to do is this:
/^(?:(?:how|do|I|you|get|far|is|it|from)\s+)*(.+)\s+to\s+(.+)/i
The regexp says to skip all the words in the multiple options, regardless of order.
If you want to preserve word order, you can use regexps such as this pseudocode:
… how (can|do|will) (I|you|we) (get|go|travel) from …
When you want to match words, \w is the most natural pattern I'd use (e.g., it is used in word count tools.)
To capture any 1 word before and after a "to" can be done with (\w+\sto\s+\w*) regex.
To return them as 2 different groups, you can use (\w+)\s+to\s+(\w+).
Have a look at the demo.
Related
I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.
I have results for FOOD, FOOD20 and FOOD 30 but have other results that come from FOOD such as DOGFOOD, CATFOOD using REGEX.
I am trying to place an EXACT filter by using:-
FOOD|FOOD20|FOOD30
to extract just these results instead of using REGEX. Unfortunately this is returning 0 results.
Is there another work around for this?
An exact filter is a literal string match, so you're explicitly looking for something matching all of "FOOD|FOOD20|FOOD30" exactly.
If you want to ensure that the value is exactly FOOD, FOOD20 or FOOD30, use REGEX matching, but precede each value with a caret (^), which marks the beginning of the line, and follow each value with the dollar sign ($), which marks the end of the line.
So, your REGEX expression would be:
^FOOD$|^FOOD20$|^FOOD30$
If your idea is to track anything that starts with "FOOD", followed by a number, and then ends, you can simplify your expression to the following:
^FOOD[0-9]*$
(The [0-9]* part means match the numbers 0 to 9 zero or more times, so it matches when there are no numbers after FOOD, or when there are some.)
This will match FOOD, FOOD20, FOOD30, FOOD99 and FOOD100, but not CATFOOD, DOGFOOD10, etc.
This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.
At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.
OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.
You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.
If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)
You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.
I'm trying to add conditional logic to determine if there's one regex match for a URL in a string. Here's an example of the string:
string_to_match = "http://www.twitpic.com/23456 ran to catch the bus, http://www.twitpic.com/3456 dodged a bullet at work."
I only want to match if I determine there's one URL in the string, so the above string wouldn't be a match in the case I'm trying to solve. I thought something like this would work:
if string_to_match =~ /[http\:\/\/]?/
puts "you're matching more then once. bad man!"
end
But it doesn't! How do I determine that there's only one match in a string?
The answer from Mladen is fine (counting the return from scan), but regular expressions already include the idea of matching the same thing multiple times or a particular number of times. In your case, you want to print the warning if your text occurs 2 or more times:
/(http:\/\/.+?){2,}/
Use .+ or .*, depending on whether you want to require the URL to have some content or not. As it stands, the .+? will match 1 or more characters in a non-greedy fashion, which is what you want. A greedy quantifier would gobble up the entire string on the first try and then have to do a bunch of backtracking before ultimately finding multiple URLs.
Take a look at String#scan, you can use it this way:
if string_to_match.scan(/[http\:\/\/]/).count > 1
puts "you're matching more then once. bad man!"
end
you could do it like this:
if string_to_match =~ /((http:\/\/.*?)http:\/\/)+/
this would match only if you have 2 (or more) occurrences of http://