Regex matching except when pattern is after another pattern - ruby

I am looking to find method names for python functions. I only want to find method names if they aren't after "def ". E.g.:
"def method_name(a, b):" # (should not match)
"y = method_name(1,2)" # (should find `method_name`)
My current regex is /\W(.*?)\(/.

str = "def no_match(a, b):\ny = match(1,2)"
str.scan(/(?<!def)\s+\w+(?=\()/).map(&:strip)
#⇒ ["match"]
The regex comments:
negative lookbehind for def,
followed by spaces (will be stripped later),
followed by one or more word symbols \w,
followed by positive lookahead for parenthesis.
Sidenote: one should never use regexps to parse long strings for any purpose.

I have assumed that lines that do not contain "def" are of the form "[something]=[zero or more spaces][method name]".
R1 = /
\bdef\b # match 'def' surrounded by word breaks
/x # free-spacing regex definition mode
R2 = /
[^=]+ # match any characters other than '='
= # match '='
\s* # match >= 0 whitespace chars
\K # forget everything matched so far
[a-z_] # match a lowercase letter or underscore
[a-z0-9_]* # match >= 0 lowercase letters, digits or underscores
[!?]? # possibly match '!' or '?'
/x
def match?(str)
(str !~ R1) && str[R2]
end
match?("def method_name1(a, b):") #=> false
match?("y = method_name2(1,2)") #=> "method_name2"
match?("y = method_name") #=> "method_name"
match?("y = method_name?") #=> "method_name?"
match?("y = def method_name") #=> false
match?("y << method_name") #=> nil
I chose to use two regexes to be able to deal with both my first and penultimate examples. Note that the method returns either a method name or a falsy value, but the latter may be either false or nil.

Related

How to find same characters in two random strings? (Ruby)

I am busy working through some problems I have found on the net and I feel like this should be simple but I am really struggling.
Say you have the string 'AbcDeFg' and the next string of 'HijKgLMnn', I want to be able to find the same characters in the string so in this case it would be 'g'.
Perhaps I wasn't giving enough information - I am doing Advent of Code and I am on day 3. I just need help with the first bit which is where you are given a string of characters - you have to split the characters in half and then compare the 2 strings. You basically have to get the common character between the two. This is what I currently have:
file_data = File.read('Day_3_task1.txt')
arr = file_data.split("\n")
finals = []
arr.each do |x|
len = x.length
divided_by_two = len / 2
second = x.slice!(divided_by_two..len).split('')
first = x.split('')
count = 0
(0..len).each do |z|
first.each do |y|
if y == second[count]
finals.push(y)
end
end
count += 1
end
end
finals = finals.uniq
Hope that helps in terms of clarity :)
Did you try to convert both strings to arrays with the String#char method and find the intersection of those arrays?
Like this:
string_one = 'AbcDeFg'.chars
string_two = 'HijKgLMnn'.chars
string_one & string_two # => ["g"]
One way to do that is to use the method String#scan with the regular expression
rgx = /(.)(?!.*\1.*_)(?=.*_.*\1)/
I'm not advocating this approach. I merely thought some readers might find it interesting.
Suppose
str1 = 'AbcDgeFg'
str2 = 'HijKgLMnbn'
Now form the string
str = "#{str1}_#{str2}"
#=> "AbcDeFg_HijKgLMnbn"
I've assumed the strings contain letters only, in which case they are separated in str with any character other than a letter. I've used an underscore. Naturally, if the strings could contain underscores a different separator would have to be used.
We then compute
str.scan(rgx).flatten
#=> ["b", "g"]
Array#flatten is needed because
str.scan(rgx)
#=>[["b"], ["g"]]
The regular expression can be written in free-spacing mode to make it self-documenting:
rgx =
/
(.) # match any character, same to capture group 1
(?! # begin a negative lookahead
.* # match zero or more characters
\1 # match the contents of capture group 1
.* # match zero or more characters
_ # match an underscore
) # end the negative lookahead
(?= # begin a positive lookahead
.* # match zero or more characters
_ # match an underscore
.* # match zero or more characters
\1 # match the contents of capture group 1
) # end the positive lookahead
/x # invoke free-spacing regex definition mode
Note that if a character appears more than once in str1 and at least once in str2 the negative lookahead ensures that only the last one in str1 is matched, to avoid returning duplicates.
Alternatively, one could write
str.gsub(rgx).to_a
The uses the (fourth) form of String#gsub which takes a single argument and no block and returns an enumerator.

How to use Regexp.union to match a character at the beginning of my string

I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:
2.4.0 :017 > MY_TOKENS = ["a", "b"]
=> ["a", "b"]
2.4.0 :018 > str = "40"
=> "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
=> nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
=> nil
I'm stumped as to what I'm doing wrong.
If they are single characters, just use MY_TOKENS.join inside the character class:
MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ first_regex
# 0
You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :
second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0
The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.
You could use Regexp#source to avoid this behaviour :
third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0
or simply build your regex union :
fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0
The 3 last examples should work fine if MY_TOKENS has words instead of just characters.
first_regex, third_regex and fourth_regex should all work fine with /i flag.
As an example :
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0
I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.
Then you need to use
Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)
or
/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)
When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).
Do not forget to match the start of a string with \A and end of string with \z anchors.
Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.
To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):
Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")
or
/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i
The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).

Matching strings that contain a letter with the first character not being a number

How do I write a regular expression that has at least one letter, but the first character must not be a number? I tried this
str = "a"
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
str = "="
str =~ /^[^\d][[:space:]]*[a-z]*/i
# => 0
The "=" is matched even though it contains no letters. I expect the"a"to match, and similarly a string like"3abcde"` should not match.
The [a-z]* and [[:space:]]* patterns can match an empty string, so they do not really make any difference when validating is necessary. Also, = is not a digit, it is matched with [^\d] negated character class that is a consuming type of pattern. It means it requires a character other than a digit in the string.
You may rely on a lookahead that will restrict the start of string position:
/\A(?!\d).*[a-z]/im
Or even a bit faster and Unicode-friendly version:
/\A(?!\d)\P{L}*\p{L}/
See the regex demo
Details:
\A - start of a string
(?!\d) - the first char cannot be a digit
\P{L}* - 0 or more (*) chars other than letters
or
.* - any 0+ chars, including line breaks if /m modifier is used)
\p{L} - a letter
The m modifier enables the . to match line break chars in a Ruby regex.
Use [a-z] when you need to restrict the letters to those in ASCII table only. Also, \p{L} may be replaced with [[:alpha:]] and \P{L} with [^[:alpha:]].
If two regular expressions were permitted you could write:
def pass_de_test?(str)
str[0] !~ /\d/ && str =~ /[[:alpha]]/
end
pass_de_test?("*!\n?a>") #=> 4 (truthy)
pass_de_test?("3!\n?a>") #=> false
If you want true or false returned, change the operative line to:
str[0] !~ /\d/ && str =~ /[[:alpha]]/) ? true : false
or
!!(str[0] !~ /\d/ && str =~ /[[:alpha]]/)

Renaming files by string from array?

I have an array of string pairs.
For example: [["vendors", "users"], ["jobs", "venues"]]
I have a list of files within a directory:
folder/
-478_accounts
-214_vendors
-389_jobs
I need somehow to rename files with the second value from subarrays so it would look like this:
folder/
-478_accounts
-214_users
-389_venues
How do I resolve the problem?
folder = %w| -478_accounts -214_vendors -389_jobs |
#=> ["-478_accounts", "-214_vendors", "-389_jobs"]
h = [["vendors", "users"], ["jobs", "venues"]].to_h
#=> {"vendors"=>"users", "jobs"=>"venues"}
r = Regexp.union(h.keys)
folder.each { |f| File.rename(f, f.sub(r,h)) if f =~ r }
I've used the form of String#sub that employs a hash to make the substitution.
You might want to refine the regex to require the string to be replaced to follow an underscore and be at the end of the string.
r = /
(?<=_) # match an underscore in a positive lookbehind
#{Regexp.union(h.keys)} # match one of the keys of `h`
\z # match end of string
/x # free-spacing regex definition mode
#=> /
# (?<=_) # match an underscore in a positive lookbehind
# (?-mix:vendors|jobs) # match one of the keys of `h`
# \z # match end of string
# /x
You don't have to use a regex.
keys = h.keys
folder.each do |f|
prefix, sep, suffix = f.partition('_')
File.rename(f, prefix+sep+h[suffix]) if sep == '_' && keys.include?(suffix)
end

Capitalize the first character after a dash

So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?
String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. à. This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681
"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"
I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"

Resources