Regex ruby syntax to select a number while excluding specific ones - ruby

I am still struggling to find some ruby regex syntax despite the numerous documentation on-line. I have an array of string and I am looking for strings that include one number (whatever the number of digits) but not specific one (let's say for instance dates from 19XX to 201X).
I manage to get the regex for "the line contain a number"
.*\p{N}.*
I manage to get "exclude the line if this number is a year"
(?!19\d\d|20[0-1]\d)\d{4}
But I fail to combine both. I would need something that would intuitively be written as such
(.*\p{N}.*)&&(?!19\d\d|20[0-1]\d)\d{4}
But I am not sure how an AND operator can be used.

Here it is:
^(?!.*19\d\d.*)(?!.*20[01]\d.*)(.*\p{N}.*)$
You want a string that:
(?!.*19\d\d.*) doesn't contains 19xx
(?!.*20[01]\d.*) doesn't contains 200x or 201x
(.*\p{N}+.*) contains, at least, one digit
In regex && means, well, literal && and not and operator
If you want to capture numbers that are not in the range 1900-2019 you can replace with:
(?!\b19\d\d\b)(?!\b20[01]\d\b)(\b\p{N}+\b)
You can test it here

While the solution by Thomas is probably the best one, another option would be to go without negation: just select everything, that matches:
re = /\D(
[03-9]\d*|
(?:1|2|20)(?=\D)|
1[0-8]\d*|
19\d?(?=\D)|
19\d{3,}|
20[2-9]\d*|
20[01]?(?=\D)|
20[01]\d{2,}
)/x
▶ 'Here 2014 and 1945 and 1878 and 20000 and 2 and 19 and 195 and 203'.scan re
#⇒ [["1878"], ["20000"], ["2"], ["19"], ["195"], ["203"]]

Related

Ruby Regex: How to match (named) groups inside square brackets?

I'm trying to write a regex in Ruby that will parse various date/time formats. The entire regex looks like this:
/^(?<year>\d{4})\-(?<month>\d{2})\-(?<day>\d{2})(T(?<hour>\d{2})(:(?<minute>\d{2})(:(?<second>\d{2}(\.\d{1,3})?))?)?)?(?<offset>[+-]\d{2}:\d{2})?$/
I'm using named groups so that I can fetch the matching parts out of the match object just using the simple names like "year", "month", "day", etc. This regex is working fine, but let's focus on the "offset" at the end of this:
(?<offset>[+-]\d{2}:\d{2})?
The problem is that I'm trying to add the ability to interpret a "Z" on the end of the string to denote UTC time (aka Zulu Time). This "Z" should be mutually exclusive with the offset. Here's some of the ways I've tried it:
(?<offset>[Z([+-]\d{2}:\d{2})])?
(?<offset>[(Z)([+-]\d{2}:\d{2})])?
[(?<zulu>Z)(?<offset>[+-]\d{2}:\d{2})]?
None of these work. In the first two cases, it can interpret date strings ending in "Z", but it can no longer interpret date string ending with actual offsets like "-07:00". In the third case, the named groups "zulu" and "offset" are just totally missing from the match object.
I think this issue is because I'm trying use square brackets to denote [(ThisGroup)(OrThisGroup)]? but I don't think the regex engine appreciates having groups inside of square brackets. How do I tell the regex engine to allow and capture "group A or group B or neither, but not both"?
Square brackets are used for "exactly one of any of these characters" -- that's not what you need here. Pattern-level alternation is done via the | operator: (hello|goodbye) world will match either hello world or goodbye world.
(?<offset>Z|[+-]\d{2}:\d{2})?
Specifically to parse a datetime, though, I suggest preferring DateTime.parse (plus to_time, if you need a Time instance). And if that isn't sufficiently flexible, consider the chronic gem.

How to implement Siri/Cortana like functionality in commandline?

I would like to implement a small subset of siri/cortana like features in command line.
For e.g.
$ What is the sum of 100 and 1000
> Response: 1100
$ What is the product of 10 and 12
> Response: 120
The questions are predefined regular expressions. It needs to call the matching function in ruby.
Pattern: What is the sum of (\d)+ and (\d)+
Ruby method to call: sum(a,b)
Any pointers/suggestion is appreciated.
That sounds exactly like cucumber, maybe take a look and see if you can just use their classes to hack something together :) ?
You could do something like the following:
question = gets.chomp
/\A.*(sum |product |quotient |difference )\D+([0-9]+)\D+([0-9]+).*\z/.match question
send($1, $2.to_i, $3.to_i)
Quick explanation for anyone that may be new to matching in Ruby:
This gets a line of input from the command line and scans it for a function name (i.e. sum, product, etc) followed by a space and potentially some non-digit characters. Then, it looks for a first number (similarly followed by a space and 0 or more non-digit characters) and a second number followed by nothing or anything. The parentheses determine what gets assigned to the variables preceded by a $, i.e. the substring that matches the contents of the first set of parentheses gets assigned to $1.
Next, it calls the method whose name is the value of $1 with the arguments (casted to integers) found in $2 and $3.
Obviously, this isn't generalized at all--you're putting the method names in the regex, and it's taking a fixed number of arguments--but it'll hopefully be useful for getting you on the right track.

ruby regex: match URL recurring pattern

I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.
\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.
I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.

Ruby (on Rails) Regex: removing thousands comma from numbers

This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.

how do I pattern match a string within a string and then extract it into a variable

I have come across a problem that I cannot see to solve. I have extracted a line from a web page into a variable. lets say for argument sake this is:
rhyme = "three blind mice Version 6.0"
and I want to be able to first of all locate the version number within this string (6.0) and secondly extract this number into another seperate variable - (I want to specifically extract no more than "6.0")
I hope I have clarified this enough, if not please ask me anything you need to know and I will get back to you asap.
First you need to decide what the pattern for a version number should be. One possibility would be \d+(\.\d+)*$ (a number followed by zero or more (dot followed by a number) at the end of the string).
Then you can use String#[] to get the substring that matches the pattern:
rhyme[ /\d+(\.\d+)*$/ ] #=> "6.0"
You need to use regular expressions. I would use rhyme.scan(/(\d+\.\d+)/) since it can return an array if multiple matches occur. It can also take a block so that you can add range checks or other checks to ensure the right one is captured.
version = "0.0"
rhyme = "three blind mice Version 6.0"
rhyme.scan(/(\d+\.\d+)/){|x| version = x[0] if x[0].to_f < 99}
p version
If the input can be trusted to yield only one match or if you always are going to use the first match you can just use the solution in this answer.
Edit: So after our discussion just go with that answer.
if rhyme =~ /(\d\.\d)/
version = $1
end
The regexp matches a digit, followed by a period, followed by another digit. The parenthesis captures its contents. Since it is the first pair of parenthesis, it is mapped to $1.

Resources