Fix regex to extract specific number formats - ruby

Ideally my regex should capture/extract all the following number formats:
500 /
500.55 /
500k /
500.55k /
500 to 600 /
500k to 600k /
500 to 600k /
500.55 to 600.55 /
500.55 to 600.55 k
I have a problem with my current regex, because if numbers like "700,000" or "800,000" or "8.54" are in the text then it splits up the numbers and captures:
700,000 => "700","000"
800,000. => "800" , "000." , "8.", "54"
8.54 => "8.", "54"
Any ideas what to change? Current regex:
(\d+(?:\.?\d*)?\s*k?(?:\-|to)\s*\d+(?:\.?\d*)\s*k?|\d+(?:\.?\d*)\s*k?)

I suggest using a bit more optional groups instead of consecutive optional atoms, and use [,.] character class instead of \. to allow 2 separators, and \p{Pd} to match any dashes:
/\d+(?:[.,]\d+)*(?:\s*k)?(?:\s*(?:\p{Pd}|to)\s*\d+(?:[.,]\d+‌​)*(?:\s*k)?)?/i
See the Rubular demo
If you want to make it more precise, the (?:[.,]\d+)* should be split into (?:\.\d+)*(?:\.\d+)?
/\d+(?:\.\d+)*(?:\.\d+)?(?:\s*k)?(?:\s*(?:\p{Pd}|to)\s*\d+(?:\.\d+)*(?:\.\d+)?(?:\s*k)?)?/i
Details:
\d+ - 1 or more digits
(?:[.,]\d+)* - 0+ sequences of . or , with 1 or more digits after
(?:\s*k)? - an optional sequence of 0+ whitespace + k / K
(?:\s*(?:\p{Pd}|to)\s*\d+(?:[.,]\d+‌​)?(?:\s*k)?)? - an optional sequence of:
\s*(?:\p{Pd}|to)\s* - any dash (\p{Pd}) or to enclosed with 0+ whitespaces
\d+(?:[.,]\d+‌​)*(?:\s*k)? - see above.

Related

How to replace all characters but for the first and last two with gsub Ruby

Given any email address I would like to leave only the first and last two characters and input 4 asterisks to the left and right of # character.
The best way to explain are examples:
lorem.ipsum#gmail.com changed to lo****#****om
foo#foo.de changed fo****#****de
How to do it with gsub?
**If you want to mask with a fixed number of * symbols, you may yse
'lorem.ipsum#gmail.com'.sub(/\A(..).*#.*(..)\z/, '\1****#****\2')
# => lo****#****om
See the Ruby demo.
Here,
\A - start of string anchor
(..) - Group 1: first 2 chars
.*#.* - any 0+ chars other than line break chars as many as possible up to the last # followed with another set of 0+ chars other than line break ones
(..) - Group 2: last 2 chars
\z - end of string.
The \1 in the replacment string refers to the value kept in Group 1, and \2 references the value in Group 2.
If you want to mask existing chars while keeping their number, you might consider an approach to capture the parts of the string you need to keep or process, and manipulate the captures inside a sub block:
'lorem.ipsum#gmail.com'.sub(/\A(..)(.*)#(.*)(..)\z/) {
$1 + "*"*$2.length + "#" + "*"*$3.length + $4
}
# => lo*********#*******om
See the Ruby demo
Details
\A - start of string
(..) - Group 1 capturing any 2 chars
(.*) - Group 2 capturing any 0+ chars as many as possible up to the last....
# - # char
(.*) - Group 3 capturing any 0+ chars as many as possible up to the
(..) - Group 4: last two chars
\z - end of string.
Note that inside the block, $1 contains Group 1 value, $2 holds Group 2 value, and so on.
Using gsub with look-ahead and look-behind regex patterns:
'lorem.ipsum#gmail.com'.gsub(/(?<=.{2}).*#.*(?=\S{2})/, '****#****')
=> "lo****#****om"
Using plain ruby:
str.first(2) + '****#****' + str.last(2)
=> "lo****#****om"
I have a solution which doesn't fully solve your problem but it's pretty flexible and I think it's worth it to share it for anyone else looking for similar solutions.
module CoreExtensions
module String
module MaskChars
def mask_chars(except_first_n: 1, except_last_n: 2, mask_with: '*')
if except_first_n.zero? && except_last_n.zero?
raise ArgumentError, "except_first_n and except_last_n can't both be zero"
end
if length < (except_first_n + except_last_n)
raise ArgumentError, "String '#{self}' must be at least #{except_first_n}"\
" (except_first_n) #{except_last_n} (except_last_n) ="\
" #{except_first_n + except_last_n} characters long"
end
sub(
/\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/,
'\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
)
end
end
end
end
Let me explain the regex in /\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/
\A - start of string
(.#{except_first_n}) or (.{1}) Group 1: first n chars. Default value of except_first_n is 1
(.*) Group 2 capturing any 0+ chars as many as possible before the last n characters
(.#{except_last_n}) or (.{2}) Group 3: last n chars. Default value of except_last_n is 2
\z - end of string
Let me explain what's happening in '\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
We are substituting the string with group 1 (\1) at the start, it'll contain characters equalling except_first_n argument's value. We are not gonna use group 2, we need to replace group 2 with the character from mask_with argument, to calculate the amount of times we need to add mask_with character, we use this formula length - (except_first_n + except_last_n) (total length of the string minus the sum value of except_first_n and except_last_n. This will ensure that we have the exact number of mask_with characters between the except_first_n and the except_last_n characters).
Then I created an initializer file config/initializers/core_extensions.rb with this line:
String.include CoreExtensions::String::MaskChars
It will add mask_chars as an instance method to the String class available to all strings.
It should work like this:
account = "123456789101112"
=> "123456789101112"
account.mask_chars
=> "1************12"
account.mask_chars(except_first_n: 3, except_last_n: 4, mask_with: '#')
=> "123########1112"
I think this is a pretty useful method which can be useful in many scenarios and very flexible too.

Regex for selecting substrings before and after a string

I am trying to find a right regex expression to select substrings between another substring, which I'd like to exclude. For example in this string:
11 - 12£ in $ + 13
I want to select 12£ and $. Basically, it's substrings around in, until I hit an array of values I want to use as end/start, in this case, arithmetic operators %w(+ - / *)
So far closest I got was using this regex /(.\d\p{Sc})\sin\s(\p{Sc})/
Some more examples:
10 - 12$ in £ - 13$ should return 12$ and £
12 $ in £ should return 12$ and £
100£in$ should return 100£ and $
sentence.match(/[^-+*\/]*in[^-+*\/]*/).to_s.strip.split(/ *in */)
[^-+*\/]* matches multiple non-arithmetic operators
this will hence get everything from the "opening" to the "closing" operator that surround an in
#strip removes the leading and trailing whitespaces
finally, split into two strings, removing in and the spaces around it
r = /
\s+[+*\/-]\s+ # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 1
\s+in\s+ # match 1+ whitespaces, 'in', 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 2
\s+[+*\/-]\s # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
/x # free-spacing regex definition mode
str = '11 - 12£ in $ + 13 / 13F in % * 4'
str.scan(r)
#=> [["12£", "$"], ["13F", "%"]]
See the doc for String#scan to see how scan handles capture groups.
Note that '-' must be first or last in the character class [+*\/-].

recognize formatted numbers using regex

1 #valid
1,5 #valid
1,5, #invalid
,1,5 #invalid
1,,5 #invalid
#'nothing' is also invalid
The number of numbers separated by commas can be arbitrary.
I'm trying to use regex to do this. This is what I have tried so far, but none of it worked:
"1,2,," =~ /^[[\d]+[\,]?]+$/ #returned 0
"1,2,," =~ /^[\d\,]+$/ #returned 0
"1,2,," =~ /^[[\d]+[\,]{,1}]+$/ #returned 0
"1,2,," =~ /^[[\d]+\,]+$/ #returned 0
Obviously, I needed the expression to recognize that 1,2,, is invalid, but they all returned 0 :(
Your patternsare not really working because:
^[[\d]+[\,]?]+$ - matches a line that contains one or more digit, +, ,, ? chars (and matches all the strings above but the last empty one)
^[\d\,]+$ - matches a line that consists of 1+ digits or , symbols
^[[\d]+[\,]{,1}]+$ - matches a line that contains one or more digit, +, ,, { and } chars
^[[\d]+\,]+$ - matches a line that contains one or more digit, +, and , chars.
Basically, the issue is that you try to rely on a character class, while you need a grouping construct, (...).
Comma-separated whole numbers can be validated with
/\A\d+(?:,\d+)*\z/
See the Rubular demo.
Details:
\A - start of string
\d+ - 1+ digits
(?:,\d+)* - zero or more occurrences of:
, - a comma
\d+ - 1+ digits
\z - end of string.

I need to grab the total number of items using regex

How can I grab the value 3 from "Page 1 of 3" given the below text:
Displaying Results Items 1 - 50 of 120, Page 1 of 3
If someone can briefly explain the regex that would be helpful.
Regular Expression you need consists of a + quantifier (one or more times - greedy), a numerical character class [0-9] and a capturing group (...).
str = "Displaying Results Items 1 - 50 of 120, Page 1 of 3"
print str.match(/Page +[0-9]+ +of +([0-9]+)/)[1]
Live demo
Explanation:
Page + # Match `Page` and any number of spaces (one or more)
[0-9]+ # Then any number of digits (one or more)
+of # Then any number of spaces (one or more) followed by `of`
+ # Then any number of spaces (one or more)
([0-9]+) # Finally up to another sequence of digits - captured by constructing a capturing group
There is a good reference here to learn more about RegExes.
you can do
str.scan(/Page \d+ of (\d+)/) #=> [["3"]]
It is trying to match the pattern of "Page # of #" and grabbing the last capture group. This will work if you have multiples of the same pattern in the string, it will all be a part of the resulting array.

Ruby regex need to exclude pattern

I have the following strings
ALEXANDRITE OVAL 5.1x7.9 GIA# 6167482443 FINE w:1.16
ALEXANDRITE OVAL 4x6 FINE w:1.16
I want to match the 5.1 and 7.9 and the 4 and 6 and not w:1.16 or w: 1.16 or the 6167482443. So far I managed to come up with these:
Matching the w:1.16 w: 1.16
([w][:]\d\.?\d*|[w][:]\s?\d\.?\d*)
Matching the other digits:
\d+\.?\d{,3}
I kind of expected this not the return the long number sequence because of the {,3} but it still does.
My questions are :
1. How do I combine the two patterns excluding one and returning the other?
2. How do I exclude the long sequence of numbers? Why is it not being excluded now?
Thanks!
You could simply use the below regex.
\b(\d+(?:\.\d+)?)x(\d+(?:\.\d+)?)
DEMO
Explanation:
\b the boundary between a word char (\w) and
something that is not a word char
( group and capture to \1:
\d+ digits (0-9) (1 or more times)
(?: group, but do not capture (optional):
\. '.'
\d+ digits (0-9) (1 or more times)
)? end of grouping
) end of \1
x 'x'
( group and capture to \2:
\d+ digits (0-9) (1 or more times)
(?: group, but do not capture (optional):
\. '.'
\d+ digits (0-9) (1 or more times)
)? end of grouping
) end of \2
([\d\.])+x([\d\.])+
matches
5.1x7.9
4x6
(\d+(?:\.\d+)?)(?=x)|(?<=x)(\d+(?:\.\d+)?)
You can try this.See demo.
http://regex101.com/r/wQ1oW3/6
2)To ignore the long string you have to use \b\d{1,3}\b to specify boundaries.
http://regex101.com/r/wQ1oW3/7
Or else a part of long string will match.

Resources