Ruby regular expressions and bracket. What do the brackets do? - ruby

I am going through the Peter Cooper book "Beginning Ruby" and I have some questions regarding some of the string methods and regular expression usage. I think I'm clear on what a regular expression is: "a string that describes a pattern for matching elements in other strings."
So:
"This is a test".scan(/\w\w/) {|x| puts x}
Output:
Th
is
is
te
st
=> "This is a test"
So it prints two characters at a time. I didn't realize it also returns the original string. Why is this?
Also,
"This is a test".scan(/[aeiou]/) { |x| puts x }
What do the brackets do? I think they are called character classes, but I am not sure exactly what they do. The explanation in Cooper's book isn't totally verbose and clear.
Explanation of character classes:
"The last important aspect of regular expressions you need to understand at this stage is
character classes. These allow you to match against a specific set of characters. For example, you can scan through all the vowels in a string:"

Yes, it is called a character class.
A character class defines a set of characters. Saying, "match one character specified by the class". The two implementations of a character class are considered a positive class [ ] and a negative class [^ ]. The positive character class allows you to define a list of characters, any one of which may appear in a string for a match to occur while the negative class allows you to define a list of characters that must NOT appear in a string for a match to occur.
Explanation of your character class:
[aeiou] # any character of: 'a', 'e', 'i', 'o', 'u'

The scan method usually returns an array with the matches, but it optionally accepts a block, which is equivalent to do an each of the resulting array.
Here is the documentation: http://www.ruby-doc.org/core-2.1.3/String.html#method-i-scan
To the second question, #hwnd already gave you a clear answer. The best way to learn this is to experiment, regex101.com is the online tool I usually use. It lists explanations for all your matching elements, so it's a wonderful learning resource too.
Some things you might like to try:
123abab12ab1234 with pattern [123]
123abab12ab1234 with pattern [ab]+
123abab12ab1234 with pattern b[1|a]

One thing to remember is that a character class matches ONE character, for example:
str = 'XXXaeiouXXX'
puts str
str.sub!(/[aeiou]/, '.')
puts str
--output:--
XXXaeiouXXX
XXX.eiouXXX
A character class says, "Match this character OR this character OR this character...ONE TIME ".
Also check out rubular:
http://rubular.com/
I didn't realize it also returns the original string. Why is this?
So that you can chain methods together:
my_str.scan(...).downcase.capitalize.each_char {|char| puts char}.upcase.chomp

Related

I want to match all punctuation in my regexp except apostrophes. How do i do that in Ruby?

This is my code so far:
def alternate_words(string)
string.gsub(/[\p{P}]/, "")
end
I am looking for a way to add exceptions to my regular expressions. Is it possible or do I have to list them all out?
string = "jack. o'reilly? mike??!?"
puts string.gsub(/[\p{P}&&[^']]/, '')
# => jack o'reilly mike
Docs:
A character class may contain another character class. By itself this isn’t useful because [a-z[0-9]] describes the same set as [a-z0-9]. However, character classes also support the && operator which performs set intersection on its arguments.
So, [\p{P}&&[^']] is "any character that is punctuation and also not an apostrophe".

opposite of sub in ruby

I want to replace the content (or delete it) that does not match with my filter.
I think the perfect description would be an opposite sub. I cannot find anything similar in the docs, and I'm not sure how to invert the regex, but I think a method would probably be the more convenient.
An example of how it would work (I've just changed the words to make it more clear)
"bird.cats.dogs".opposite_sub(/(dogs|cats)\.(dogs|cats)/, '')
#"cats.dogs"
I hope it's easy enough to understand.
Thanks in advance.
String#[] can take a regular expression as its parameter:
▶ "bird.cats.dogs"[/(dogs|cats)\.(dogs|cats)/]
#⇒ "cats.dogs"
For multiple matches one can use String#scan:
▶ "bird.cats.dogs.bird.cats.dogs".scan /(?:dogs|cats)\.(?:dogs|cats)/
#⇒ ["cats.dogs", "cats.dogs"]
So you want to extract the part that matches your regex?
You can use String#slice, for example:
"bird.cats.dogs".slice(/(dogs|cats)\.(dogs|cats)/)
#=> "cats.dogs"
And String#[] does the same.
"bird.cats.dogs"[/(dogs|cats)\.(dogs|cats)/]
#=> "cats.dogs"
You cannot have a single replacement string because the part of the string that matches the regex might not be at the beginning or end of the string, in which case it's not clear whether the replacement string should precede or follow the matching string. I've therefore written the following with two replacement strings, one for pre-match, the other for post_match. I've made this a method of the String class as that's what you've asked for (though I've given the method a less-perfect name :-) )
class String
def replace_non_matching(regex, replace_before, replace_after)
first, match, last = partition(regex)
replace_before + match + replace_after
end
end
r = /(dogs|cats)\.(dogs|cats)/
"birds.cats.dogs.pigs".replace_non_matching(r, "", "")
#=> "cats.dogs"
"birds.cats.dogs".replace_non_matching(r, "snakes.", ".hens")
#=> "snakes.cats.dogs.hens"
"birds.cats.dogs.mice.cats.dogs.bats".replace_non_matching(r, "snakes.", ".hens")
#=> "snakes.cats.dogs.hens"
Regarding the last example, the method could be modified to replace "birds.", ".mice." and ".bats", but in that case three replacement strings would be needed. In general, determining in advance the number of replacement strings needed could be problematic.

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

Understanding how pattern matching works in Ruby 2

I don't know how pattern matching works in Ruby 2.
I have the following value, targetfilename = /mnt/usb/mpeg4Encoded.mpeg4
My pattern matching if-else is thus:
if (targetfilename.match(/^\//))
puts "amit"
else
puts "ramit"
The output is ramit.
I don't understand how this pattern matching works though.
if targetfilename.match(/^V/)
puts "amit"
else
puts "ramit"
end
# result:
# "amit"
Why is this? This is because targetfilename.match(/^V/) outputs a Matchdata object (click on the link for a full description of this object). This is an object that contains all of the information that is in the "matching". If there is no match, no MatchData object is returned, because there's nothing to return. Instead, you get nil.
When you use if, if it tries to compare a nil, it treats it the same way as false.
Basically, any "actual" value (besides false) is treated the same way as true. Basically, it's asking
if (there's anything here)
do_this
else
do_something_else
end
Again, let me reiterate:
If the thing after if is either false or nil, the if statement resolves to the "else".
If it's anything else, it resolves as if it had gotten a "true" statement.
Regular Expressions
/^V/ is what is called a "Regular Expression"; the // is a Regexp literal the same way that the "" is a String literal, and Regexps are represented by the Regexp class the same way that strings are represented by the String class.
The actual "regular expression" is what's between the slashes -- ^V. This is saying:
^: the start of a string
V: a capital letter V
So, /^V/ will match any cases of the capital letter "V" at the beginning of a string.
What else can you put in a regular expression? What are the special characters? Try this regexp cheat sheet
Also, some great tools:
Rubular -- enter in your regular expression, and then a same text, and see what matches.
Strfriend -- enter in a regular expression and see it "visually" represented.

Resources