Regex for selecting substrings before and after a string - ruby

I am trying to find a right regex expression to select substrings between another substring, which I'd like to exclude. For example in this string:
11 - 12£ in $ + 13
I want to select 12£ and $. Basically, it's substrings around in, until I hit an array of values I want to use as end/start, in this case, arithmetic operators %w(+ - / *)
So far closest I got was using this regex /(.\d\p{Sc})\sin\s(\p{Sc})/
Some more examples:
10 - 12$ in £ - 13$ should return 12$ and £
12 $ in £ should return 12$ and £
100£in$ should return 100£ and $

sentence.match(/[^-+*\/]*in[^-+*\/]*/).to_s.strip.split(/ *in */)
[^-+*\/]* matches multiple non-arithmetic operators
this will hence get everything from the "opening" to the "closing" operator that surround an in
#strip removes the leading and trailing whitespaces
finally, split into two strings, removing in and the spaces around it

r = /
\s+[+*\/-]\s+ # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 1
\s+in\s+ # match 1+ whitespaces, 'in', 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 2
\s+[+*\/-]\s # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
/x # free-spacing regex definition mode
str = '11 - 12£ in $ + 13 / 13F in % * 4'
str.scan(r)
#=> [["12£", "$"], ["13F", "%"]]
See the doc for String#scan to see how scan handles capture groups.
Note that '-' must be first or last in the character class [+*\/-].

Related

ignore a specific \n character while still enabling the m flag

I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?
The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.

Split a string by '":"' or a space after a numerical digit

I have a string like:
string = "roll:34 name:joshi ikera"
I want to split this string by the delimiting : and the space between the roll value and the name key. The output should look like this:
[roll, 34, name, joshi ikera]
I tried using:
string.split(/:|\d\s/)
but the output that I get is:
[roll, 3, name, joshi ikera]
How do I include the missing digit and just split by the space after the digit?
The \d\s matches and consumes the digit before a whitespace, and the consumed text is deleted by the Regexp#split() method. You need to use a lookaround, a lookbehind in this case, to make it a non-consuming pattern part, /:|(?<=\d)\s/ (see valtlai's comment). However, a more common approach in this scenario is to match 1 or more whitespace chars that are followed with 1+ word chars (if keys can only contain digits, letters and underscores) followed with : (see Sagar's comment).
I suggest
s.split(/\s+(?=\w+:)|:/)
# => roll
34
name
joshi ikera
Here,
\s+ - consumes 1+ whitespace chars
(?=\w+:) - that are followed with 1+ word chars and :
| - or
: - match and consume :.
Or, if the keys are unique
s.scan(/(\w+):(.*?)(?=\w+:|\z)/).to_h
# => {"roll"=>"34 ", "name"=>"joshi ikera"}
Here,
(\w+) - 1 or more word chars are captured into Group 1
: - a colon is matched
(.*?) - any 0+ chars other than line break chars are captured into Group 2 if immediately followed with
(?=\w+:|\z) - either 1+ word chars and then : (\w+:) or (|) end of string (\z).

recognize formatted numbers using regex

1 #valid
1,5 #valid
1,5, #invalid
,1,5 #invalid
1,,5 #invalid
#'nothing' is also invalid
The number of numbers separated by commas can be arbitrary.
I'm trying to use regex to do this. This is what I have tried so far, but none of it worked:
"1,2,," =~ /^[[\d]+[\,]?]+$/ #returned 0
"1,2,," =~ /^[\d\,]+$/ #returned 0
"1,2,," =~ /^[[\d]+[\,]{,1}]+$/ #returned 0
"1,2,," =~ /^[[\d]+\,]+$/ #returned 0
Obviously, I needed the expression to recognize that 1,2,, is invalid, but they all returned 0 :(
Your patternsare not really working because:
^[[\d]+[\,]?]+$ - matches a line that contains one or more digit, +, ,, ? chars (and matches all the strings above but the last empty one)
^[\d\,]+$ - matches a line that consists of 1+ digits or , symbols
^[[\d]+[\,]{,1}]+$ - matches a line that contains one or more digit, +, ,, { and } chars
^[[\d]+\,]+$ - matches a line that contains one or more digit, +, and , chars.
Basically, the issue is that you try to rely on a character class, while you need a grouping construct, (...).
Comma-separated whole numbers can be validated with
/\A\d+(?:,\d+)*\z/
See the Rubular demo.
Details:
\A - start of string
\d+ - 1+ digits
(?:,\d+)* - zero or more occurrences of:
, - a comma
\d+ - 1+ digits
\z - end of string.

Remove Certain Alphanumeric Characters from a String in Ruby

I have to validate a string based on first alpha-numeric character of the string. Certain characters can be part of the string but if they are at beginning then they have to ignored.
For example:
--- BATest- 1 --
should be:
BATest-1
How do I remove dashes from beginning and end but not from middle?
To add to my question: can the first alphanumeric character decide if following alphanumeric characters are to be removed or not?
I.e. If A then nothing would need to be removed and throw a validation error; and yet if B then strip the string as mentioned above.
r = /
--+ # Match at least two hyphens
| # or
\s # Match a space
/x # Free-spacing regex definition mode
'--- BATest- 1 --'.gsub r, ""
#=> "BATest-1"
You asked to remove the dashes from the beginning and the end:
"--- BATest- 1 --".gsub(/^-+|-+$|\s/, "")
# => "BATest-1"

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Resources