ignore a specific \n character while still enabling the m flag - ruby

I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?

The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.

Related

Split a string by '":"' or a space after a numerical digit

I have a string like:
string = "roll:34 name:joshi ikera"
I want to split this string by the delimiting : and the space between the roll value and the name key. The output should look like this:
[roll, 34, name, joshi ikera]
I tried using:
string.split(/:|\d\s/)
but the output that I get is:
[roll, 3, name, joshi ikera]
How do I include the missing digit and just split by the space after the digit?
The \d\s matches and consumes the digit before a whitespace, and the consumed text is deleted by the Regexp#split() method. You need to use a lookaround, a lookbehind in this case, to make it a non-consuming pattern part, /:|(?<=\d)\s/ (see valtlai's comment). However, a more common approach in this scenario is to match 1 or more whitespace chars that are followed with 1+ word chars (if keys can only contain digits, letters and underscores) followed with : (see Sagar's comment).
I suggest
s.split(/\s+(?=\w+:)|:/)
# => roll
34
name
joshi ikera
Here,
\s+ - consumes 1+ whitespace chars
(?=\w+:) - that are followed with 1+ word chars and :
| - or
: - match and consume :.
Or, if the keys are unique
s.scan(/(\w+):(.*?)(?=\w+:|\z)/).to_h
# => {"roll"=>"34 ", "name"=>"joshi ikera"}
Here,
(\w+) - 1 or more word chars are captured into Group 1
: - a colon is matched
(.*?) - any 0+ chars other than line break chars are captured into Group 2 if immediately followed with
(?=\w+:|\z) - either 1+ word chars and then : (\w+:) or (|) end of string (\z).

How do I match a regex in which the next non-space character is not a "/"?

How do I express in regex the letter "s" whose next non-space character is not a "/"?
These should match: "s", "str"
These should not: "s/m", "s /n"
I tried this
"str" =~ /s[^[[:space:]]]^\// #=> nil
but it does not even match the simple use case.
It seems you need to match any s that is not followed with any 0+ whitespace chars and a / after them.
Use
/s(?![[:space:]]*\/)/
See the Rubular demo.
Details
s - the letter s
(?![[:space:]]*\/) - a negative lookahead that fails the match if, immediately to the right of the current location, there are
[[:space:]]* - 0+ whitespaces
\/ - a /.
If you merely want to know the number of 's' characters that are not followed by zero or more spaces and then a forward slash (as opposed to their indices in the string), you don't have to use a regular expression.
"sea shells /by the sea s/hore".delete(" ").gsub("s/", "").count("s")
#=> 3
If you only want to know if there is at least one such 's' you could replace count("s") with include?("s").
I'm not arguing that this is preferable to the use of a regular expression.

Regex for selecting substrings before and after a string

I am trying to find a right regex expression to select substrings between another substring, which I'd like to exclude. For example in this string:
11 - 12£ in $ + 13
I want to select 12£ and $. Basically, it's substrings around in, until I hit an array of values I want to use as end/start, in this case, arithmetic operators %w(+ - / *)
So far closest I got was using this regex /(.\d\p{Sc})\sin\s(\p{Sc})/
Some more examples:
10 - 12$ in £ - 13$ should return 12$ and £
12 $ in £ should return 12$ and £
100£in$ should return 100£ and $
sentence.match(/[^-+*\/]*in[^-+*\/]*/).to_s.strip.split(/ *in */)
[^-+*\/]* matches multiple non-arithmetic operators
this will hence get everything from the "opening" to the "closing" operator that surround an in
#strip removes the leading and trailing whitespaces
finally, split into two strings, removing in and the spaces around it
r = /
\s+[+*\/-]\s+ # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 1
\s+in\s+ # match 1+ whitespaces, 'in', 1+ whitespaces
(\S+) # match 1+ non-whitespaces in capture group 2
\s+[+*\/-]\s # match 1+ whitespaces, 1 char in char class, 1+ whitespaces
/x # free-spacing regex definition mode
str = '11 - 12£ in $ + 13 / 13F in % * 4'
str.scan(r)
#=> [["12£", "$"], ["13F", "%"]]
See the doc for String#scan to see how scan handles capture groups.
Note that '-' must be first or last in the character class [+*\/-].

recognize formatted numbers using regex

1 #valid
1,5 #valid
1,5, #invalid
,1,5 #invalid
1,,5 #invalid
#'nothing' is also invalid
The number of numbers separated by commas can be arbitrary.
I'm trying to use regex to do this. This is what I have tried so far, but none of it worked:
"1,2,," =~ /^[[\d]+[\,]?]+$/ #returned 0
"1,2,," =~ /^[\d\,]+$/ #returned 0
"1,2,," =~ /^[[\d]+[\,]{,1}]+$/ #returned 0
"1,2,," =~ /^[[\d]+\,]+$/ #returned 0
Obviously, I needed the expression to recognize that 1,2,, is invalid, but they all returned 0 :(
Your patternsare not really working because:
^[[\d]+[\,]?]+$ - matches a line that contains one or more digit, +, ,, ? chars (and matches all the strings above but the last empty one)
^[\d\,]+$ - matches a line that consists of 1+ digits or , symbols
^[[\d]+[\,]{,1}]+$ - matches a line that contains one or more digit, +, ,, { and } chars
^[[\d]+\,]+$ - matches a line that contains one or more digit, +, and , chars.
Basically, the issue is that you try to rely on a character class, while you need a grouping construct, (...).
Comma-separated whole numbers can be validated with
/\A\d+(?:,\d+)*\z/
See the Rubular demo.
Details:
\A - start of string
\d+ - 1+ digits
(?:,\d+)* - zero or more occurrences of:
, - a comma
\d+ - 1+ digits
\z - end of string.

Confusion with the output of /\S\W/ and /\W\S/ in Ruby 1.9.3

I just familiared with the \W and \S. Now i was playing to see how they behave and accordongly tried the below:
> s="abd12 de 5t6"
=> "abd12 de 5t6" #understood
> /\W/ =~ s
=> 5 #understood
> /\W\S/ =~ s
=> 5 #Confusion(A)
> /\S\W/ =~ s
=> 4 #Confusion(B)
> /\S/ =~ s
=> 0 #understood
>
What the logic ran in Part-A and Part-B to give the output as 5 and 4. Just wanted to clear my concept there. In Part-A 5 indicates a non-word character but that is not a non- white space charater also.
I just want to know How IRB treat such statements in the confusion - A and B?
Thanks
When you have \W\S in your regular expression, you are essentially saying: "Find a match in the string where a character is a non-word character, followed by a non-space character."
In Confusion A the first non-word character is the first space (at index 5). The next character right after it is the d which is a non-space character. That's a match and therefore returns 5 since that's the index where the match began.
Similarly, for the \S\W the first non-space character is a, but it's followed by b which is a word character, so the match doesn't work yet. Once it gets to the 2 (position 4), that matches a non-space character and it is followed by the space which is a non-word character.

Resources