I want to match character pairs in a string. Let's say the string is:
"zttabcgqztwdegqf". Both "zt" and "gq" are matching pairs of characters in the string.
The following code finds the "zt" matching pair, but not the "gq" pair:
#!/usr/bin/env ruby
string = "zttabcgqztwdegqf"
puts string.scan(/.{1,2}/).detect{ |c| string.count(c) > 1 }
The code provides matching pairs where the indices of the pairs are 0&1,2&3,4&5... but not 1&2,3&4,5&6, etc:
zt
ta
bc
gq
zt
wd
eg
qf
I'm not sure regex in Ruby is the best way to go. But I want to use Ruby for the solution.
You can do your search with a single regex:
puts string.scan(/(?=(.{2}).*\1)/)
regex101 demo
Output
zt
gq
Regex Breakout
(?= # Start a lookahead
(.{2}) # Search any couple of char and group it in \1
.*\1 # Search ahead in the string for another \1 to validate
) # Close lookahead
Note
Putting all the checks inside lookahead assure the regex engine does not consume the couple when validates it.
So it also works with overlapping couples like in the string abcabc: the output will correctly be ab,bc.
Oddity
If the regex engine does not consume the chars how it can reach the end of the string?
Internally after the check Onigmo (the ruby regex engine) makes one step further automatically. Most regex flavours behaves in this way but e.g. the javascript engine needs the programmer to increment the last match index manually.
str = "ztcabcgqzttwtcdegqf"
r = /
(.) # match any character in capture group 1
(?= # begin a positive lookahead
(.) # match any character in capture group 2
.+ # match >= 1 characters
\1 # match capture group 1
\2 # match capture group 2
) # close positive lookahead
/x # extended/free-spacing regex definition mode
str.scan(r).map(&:join)
#=> ["zt", "tc", "gq"]
Here is one way to do this without using regex:
string = "zttabcgqztwdegqf"
p string.split('').each_cons(2).map(&:join).select {|i| string.scan(i).size > 1 }.uniq
#=> ["zt", "gq"]
Related
I am busy working through some problems I have found on the net and I feel like this should be simple but I am really struggling.
Say you have the string 'AbcDeFg' and the next string of 'HijKgLMnn', I want to be able to find the same characters in the string so in this case it would be 'g'.
Perhaps I wasn't giving enough information - I am doing Advent of Code and I am on day 3. I just need help with the first bit which is where you are given a string of characters - you have to split the characters in half and then compare the 2 strings. You basically have to get the common character between the two. This is what I currently have:
file_data = File.read('Day_3_task1.txt')
arr = file_data.split("\n")
finals = []
arr.each do |x|
len = x.length
divided_by_two = len / 2
second = x.slice!(divided_by_two..len).split('')
first = x.split('')
count = 0
(0..len).each do |z|
first.each do |y|
if y == second[count]
finals.push(y)
end
end
count += 1
end
end
finals = finals.uniq
Hope that helps in terms of clarity :)
Did you try to convert both strings to arrays with the String#char method and find the intersection of those arrays?
Like this:
string_one = 'AbcDeFg'.chars
string_two = 'HijKgLMnn'.chars
string_one & string_two # => ["g"]
One way to do that is to use the method String#scan with the regular expression
rgx = /(.)(?!.*\1.*_)(?=.*_.*\1)/
I'm not advocating this approach. I merely thought some readers might find it interesting.
Suppose
str1 = 'AbcDgeFg'
str2 = 'HijKgLMnbn'
Now form the string
str = "#{str1}_#{str2}"
#=> "AbcDeFg_HijKgLMnbn"
I've assumed the strings contain letters only, in which case they are separated in str with any character other than a letter. I've used an underscore. Naturally, if the strings could contain underscores a different separator would have to be used.
We then compute
str.scan(rgx).flatten
#=> ["b", "g"]
Array#flatten is needed because
str.scan(rgx)
#=>[["b"], ["g"]]
The regular expression can be written in free-spacing mode to make it self-documenting:
rgx =
/
(.) # match any character, same to capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1 # match the contents of capture group 1
.* # match zero or more characters
_ # match an underscore
) # end the negative lookahead
(?= # begin a positive lookahead
.* # match zero or more characters
_ # match an underscore
.* # match zero or more characters
\1 # match the contents of capture group 1
) # end the positive lookahead
/x # invoke free-spacing regex definition mode
Note that if a character appears more than once in str1 and at least once in str2 the negative lookahead ensures that only the last one in str1 is matched, to avoid returning duplicates.
Alternatively, one could write
str.gsub(rgx).to_a
The uses the (fourth) form of String#gsub which takes a single argument and no block and returns an enumerator.
I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.
I do not have access to the code, this is via an interface that only allows me to edit the regex that parses user responses. I need to extract the weight after users text, where they text things like:
wt 172.5
172.5 lbs
180
wt. 173.22
172,5
I need to capture the weight as a float field, but I want to restrict it to at most 1 decimal place. I tried using /(?<val>[\d+((\.|,)\d\d?)?]/ but it is only saving the first digit "1" in the field
Sometimes what seems most simple is not. I suggest using this regex:
r = /(?<=\A|\s)\d+(?:[.,]\d)?(?=\d|\s|\z)/
We can alternatively define the regex using extended or free-spacing mode (by adding the modifier x after the final /), which allows us to include documentation:
r = /
(?<=\A|\s) # match beginning of string or space in a positive lookbehind
\d+ # match one or more digits
(?:[.,]\d)? # optionally (? after non-capture group) match a . or , then a digit
(?=\d|\s|\z) # match a digit, space or the end of the string in a positive lookahead
/x
"wt 172.5"[r] #=> "172.5"
"172.5 lbs"[r] #=> "172.5"
"180"[r] #=> "180"
"wt. 173.22"[r] #=> "173.2"
"172,5"[r] #=> "172,5"
"A1 143.66"[r] #=> "143.6"
"A1 1.3.4 43.6"[r] #=> "43.6"
\d+(?:[,.]\d{1,2})?
Guess you wanted this .[] is character class,not what you think.Your character class captures just one out of all characters you have defined.
See demo.
https://regex101.com/r/eB8xU8/12
I'm trying to grab id number from the string, say
id/number/2000GXZ2/ref=sr
using
(?:id\/number\/)([a-zA-Z0-9]{8})
for some reason non capture group is not worked, giving me:
id/number/2000GXZ2
As mentioned by others, non-capturing groups still count towards the overall match. If you don't want that part in your match use a lookbehind.
Rubular example
(?<=id\/number\/)([a-zA-Z0-9]{8})
(?<=pat) - Positive lookbehind assertion: ensures that the preceding characters match pat, but doesn't include those characters in the matched text
Ruby Doc Regexp
Also, the capture group around the id number is unnecessary in this case.
You have:
str = "id/number/2000GXZ2/ref=sr"
r = /
(?:id\/number\/) # match string in a non-capture group
([a-zA-Z0-9]{8}) # match character in character class 8 times, in capture group 1
/x # extended/free-spacing regex definition mode
Then (using String#[]):
str[r]
#=> "id/number/2000GXZ2"
returns the entire match, as it should, not just the contents of capture group 1. There are a few ways to remedy this. Consider first ones that do not use a capture group.
#jacob.m suggested putting the first part in a positive lookbehind (modified slightly from his code):
r = /
(?<=id\/number\/) # match string in positive lookbehind
[[:alnum:]]{8} # match >= 1 alphameric characters
/x
str[r]
#=> "2000GXZ2"
An alternative is:
r = /
id\/number\/ # match string
\K # forget everything matched so far
[[:alnum:]]{8} # match 8 alphanumeric characters
/x
str[r]
#=> "2000GXZ2"
\K is especially useful when the match to forget is variable-length, as (in Ruby) positive lookbehinds do not work with variable-length matches.
With both of these approaches, if the part to be matched contains only numbers and capital letters, you may want to use [A-Z0-9]+ instead of [[:alnum:]] (though the latter includes Unicode letters, not just those from the English alphabet). In fact, if all the entries have the form of your example, you might be able to use:
r = /
\d # match a digit
[A-Z0-9]{7} # match >= 0 capital letters or digits
/x
str[r]
#=> "2000GXZ2"
The other line of approach is to keep your capture group. One simple way is:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
/x
str =~ r
str[r, 1] #=> "2000GXZ2"
Alternatively, you could use String#sub to replace the entire string with the contents of the capture group:
r = /
id\/number\/ # match string
([[:alnum:]]{8}) # match >= 1 alphameric characters in capture group 1
.* # match the remainder of the string
/x
str.sub(r, '\1') #=> "2000GXZ2"
str.sub(r, "\\1") #=> "2000GXZ2"
str.sub(r) { $1 } #=> "2000GXZ2"
This is Ruby Regexp expected match consistency evilness. Some Regexp-style methods will return the global-match while others will return specified matches.
In this case, one method we can use to get the behavior you're looking for is scan.
I don't think anyone here actually mentions how to get your Regexp working as you originally intended, which was to get the capture-only match. To do that, you would use the scan method like so with your original pattern:
test_me.rb
test_string="id/number/2000GXZ2/ref=sr"
result = test_string.scan(/(?:id\/number\/)([a-zA-Z0-9]{8})/)
puts result
2000GXZ2
That said, replacing (?:) with (?<=) for non-capture groups for look-behinds will benefit you both when you use scan as well as other parts of ruby that use Regexps.
I have a specific RegEx question, simply because I'm wracking my brain trying to isolate one item, rather than multiple.
Here is my text I am searching:
"{\"1430156203913\"=>\"ABC\", \"1430156218698\"=>\"DEF\", \"1430156219763\"=>\"GHI\", \"1430156553620\"=>\"JKL", \"1430156793764\"=>\"MNO\", \"1430156799454\"=>\"PQR\"}"
What I would like to do is capture the key associated with ABC as well as the key associated with GHI
The first is easy enough, and I'm capturing it with this RegEx:
/\d.*ABC/ maps to: "1430156203913\"=>\"ABC.
Then, I'd just use /\d/ to pull out the 1430156203913 key that I'm looking for.
The second is what I'm having difficulty with.
This does not work:
/\d.*GHI/ maps to the start of the first digit to my final string (GHI) ->
1430156203913\"=>\"ABC\", \"1430156218698\"=>\"DEF\", \"1430156219763\"=>\"GHI
Question:
How can I edit this second Regex to just capture 1430156219763?
To capture all the keys and values, use String#scan.
pairs = s.scan /"(\d+)"=>"([[:upper:]]+)"/
# [["1430156203913", "ABC"],
# ["1430156218698", "DEF"],
# ["1430156219763", "GHI"],
# ["1430156553620", "JKL"],
# ["1430156793764", "MNO"],
# ["1430156799454", "PQR"]]
Then to get any k/v pair you want, hashify and find them.
Hash[pairs].invert.values_at('ABC', 'DEF')
# ["1430156203913", "1430156218698"]
\d[^,]*GHI
You can simply use this.See demo.
https://regex101.com/r/yW3oJ9/3
Try this one to capture only the key
\b\d+\b(?=\\"=>\\"GHI)
check this Demo
Notice I've added \b around d+ just for a better perfromance
This is one way you could do that:
def extract(str, key)
r = /
(?<=\") # match \" in a positive lookbehind
\d+ # match one or more digits
(?= # begin a positive lookahead
\"=>\" # match \"=>\"
#{key} # match value of key
\" # match \"
) # end positive lookahead
/x
str[r]
end
extract str, "ABC" #=> "1430156203913"
extract str, "GHI" #=> "1430156219763"