Split by multiple delimiters - ruby

I'm receiving a string that contains two numbers in a handful of different formats:
"344, 345", "334,433", "345x532" and "432 345"
I need to split them into two separate numbers in an array using split, and then convert them using Integer(num).
What I've tried so far:
nums.split(/[\s+,x]/) # split on one or more spaces, a comma or x
However, it doesn't seem to match multiple spaces when testing. Also, it doesn't allow a space in the comma version shown above ("344, 345").
How can I match multiple delimiters?

You are using a character class in your pattern, and it matches only one character. [\s+,x] matches 1 whitespace, or a +, , or x. You meant to use (?:\s+|x).
However, perhaps, a mere \D+ (1 or more non-digit characters) should suffice:
"345, 456".split(/\D+/).map(&:to_i)

R1 = Regexp.union([", ", ",", "x", " "])
#=> /,\ |,|x|\ /
R2 = /\A\d+#{R1}\d+\z/
#=> /\A\d+(?-mix:,\ |,|x|\ )\d+\z/
def split_it(s)
return nil unless s =~ R2
s.split(R1).map(&:to_i)
end
split_it("344, 345") #=> [344, 345]
split_it("334,433") #=> [334, 433]
split_it("345x532") #=> [345, 532]
split_it("432 345") #=> [432, 345]
split_it("432&345") #=> nil
split_it("x32 345") #=> nil

Your original regex would work with a minor adjustment to move the '+' symbol outside the character class:
"344 ,x 345".split(/[\s,x]+/).map(&:to_i) #==> [344,345]
If the examples are actually the only formats that you'll encounter, this will work well. However, if you have to be more flexible and accommodate unknown separators between the numbers, you're better off with the answer given by Wiktor:
"344 ,x 345".split(/\D+/).map(&:to_i) #==> [344,345]
Both cases will return an array of Integers from the inputs given, however the second example is both more robust and easier to understand at a glance.

it doesn't seem to match multiple spaces when testing
Yeah, character class (square brackets) doesn't work like this. You apply quantifiers on the class itself, not on its characters. You could use | operator instead. Something like this:
.split(%r[\s+|,\s*|x])

Related

Use regular expression to fetch 3 groups from string

This is my expected result.
Input a string and get three returned string.
I have no idea how to finish it with Regex in Ruby.
this is my roughly idea.
match(/(.*?)(_)(.*?)(\d+)/)
Input and expected output
# "R224_OO2003" => R224, OO, 2003
# "R2241_OOP2003" => R2244, OOP, 2003
If the example description I gave in my comment on the question is correct, you need a very straightforward regex:
r = /(.+)_(.+)(\d{4})/
Then:
"R224_OO2003".scan(r).flatten #=> ["R224", "OO", "2003"]
"R2241_OOP2003".scan(r).flatten #=> ["R2241", "OOP", "2003"]
Assuming that your three parts consist of (R and one or more digits), then an underbar, then (one or more non-whitespace characters), before finally (a 4-digit numeric date), then your regex could be something like this:
^(R\d+)_(\S+)(\d{4})$
The ^ indicates start of string, and the $ indicates end of string. \d+ indicates one or more digits, while \S+ says one or more non-whitespace characters. The \d{4} says exactly four digits.
To recover data from the matches, you could either use the pre-defined globals that line up with your groups, or you could could use named captures.
To use the match globals just use $1, $2, and $3. In general, you can figure out the number to use by counting the left parentheses of the specific group.
To use the named captures, include ? right after the left paren of a particular group. For example:
x = "R2241_OOP2003"
match_data = /^(?<first>R\d+)_(?<second>\S+)(?<third>\d{4})$/.match(x)
puts match_data['first'], match_data['second'], match_data['third']
yields
R2241
OOP
2003
as expected.
As long as your pattern covers all possibilities, then you just need to use the match object to return the 3 strings:
my_match = "R224_OO2003".match(/(.*?)(_)(.*?)(\d+)/)
#=> #<MatchData "R224_OO2003" 1:"R224" 2:"_" 3:"OO" 4:"2003">
puts my_match[0] #=> "R224_OO2003"
puts my_match[1] #=> "R224"
puts my_match[2] #=> "_"
puts my_match[3] #=> "00"
puts my_match[4] #=> "2003"
A MatchData object contains an array of each match group starting at index [1]. As you can see, index [0] returns the entire string. If you don't want the capture the "_" you can leave it's parentheses out.
Also, I'm not sure you are getting what you want with the part:
(.*?)
this basically says one or more of any single character followed by zero or one of any single character.

How does this regexp split on the first vowel?

This code splits a word into two strings at the first vowel. Why?
word = "banana"
parts = word.split(/([aeiou].*)/)
The key here is the regular expression (or regex) that is being used between the two /'s
[aeiou] says to look for the first instance of one of those characters.
. matches any single character
* modifies the previous thing to mean match 0 or more of it
(...) means capture everything enclosed between the parentheses
Translated to english this regular expression might read something like "Given a string, find the first vowel that is followed by zero or more characters. Collect that vowel and its following characters and set them aside."
The slightly more confusing part is the regex's interaction with the split method. The value the regex returns is 'anana'. And we can see that calling split with 'anana' doesn't have the same result:
'banana'.split('anana') #=> ["b"]
But when split is called with a regular expression that uses a capture group - or parentheses (...), then anything in that capture group will also be returned in the result of the split. Which is why:
'banana'.split /([aeiou].*)/ #=> ["b", "anana"]
If you want to learn more about how regular expressions work (particularly in ruby), Rubular is a great resource to fiddle with - http://www.rubular.com/r/XEUgPhOdlH
This is actually a bit tricky. This regexp
/[aeiou].*/
matches the string from the first vowel to the end of the string i.e. "anana". But if you were to split on that, you would only get the first letter since split doesn't include the splitting pattern:
"banana".split /[aeiou].*/
# ["b"]
But according to the String#split docs, if the splitting pattern is a regexp with a capture group, the capture groups are included in the result as well. Since the whole pattern is wrapped in a capture group, the result is that the string splits before the first vowel.
For example, if you change the regexp to have two capture groups, it splits further:
"banana".split /([aeiou])(.*)/
# ["b", "a", "nana"]
ANSWER FOR OLD TITLE
It's not really a Ruby's syntax, it's a standard Regular Expression's syntax that also implemented by Ruby.
* means zero or more of previous item
. means any character
[aeiou] means any character inside the brace
() means capture it
So that regex means: capture anything that starts with a, e, i, o, or u.
the word.split(/([aeiou].*)/) means, split the word variable based on anything that starts with letter a, e, i, o, or u.
See here fore more information.
ANSWER FOR NEW TITLE
Why does it split on the first vowel? It's not really like that.. What it does is, split by anything that start with vowels and capture it (the string that starts with vowels) also, see more example here:
word = 'banana'
word.split /[aeiou]/ # split by vowels
#=> ["b", "n", "n"]
word.split /([aeiou])/ # split by vowels and capture the vowels
#=> ["b", "a", "n", "a", "n", "a"]
word.split /[aeiou].*/ # split by anything that start with vowels
#=> ["b"]
word.split /([aeiou].*)/ # split by anything that start with vowels and capture the thing that start with vowels also
#=> ["b", "anana"]
ANSWER FOR OLD TITLE
If the * symbol not inside the regular expression // (Ruby's syntax), there are some possibilities:
multiplication 2 * 3 == 6, 'na' * 3 == 'nanana' # batman!
splat operation [*(1..4)] == [1,2,3,4], see more info here

How do I write a regular expression that will match characters in any order?

I'm trying to write a regular expressions that will match a set of characters without regard to order. For example:
str = "act"
str.scan(/Insert expression here/)
would match:
cat
act
tca
atc
tac
cta
but would not match ca, ac or cata.
I read through a lot of similar questions and answers here on StackOverflow, but have not found one that matches my objectives exactly.
To clarify a bit, I'm using ruby and do not want to allow repeat characters.
Here is your solution
^(?:([act])(?!.*\1)){3}$
See it here on Regexr
^ # matches the start of the string
(?: # open a non capturing group
([act]) # The characters that are allowed and a capturing group
(?!.*\1) # That character is matched only if it does not occur once more, Lookahead assertion
){3} # Defines the amount of characters
$
The only special think is the lookahead assertion, to ensure the character is not repeated.
^ and $ are anchors to match the start and the end of the string.
[act]{3} or ^[act]{3}$ will do it in most regular expression dialects. If you can narrow down the system you're using, that will help you get a more specific answer.
Edit: as mentioned by #georgydyer in the comments below, it's unclear from your question whether or not repeated characters are allowed. If not, you can adapt the answer from this question and get:
^(?=[act]{3}$)(?!.*(.).*\1).*$
That is, a positive lookahead to check a match, and then a negative lookahead with a backreference to exclude repeated characters.
Here's how I'd go about it:
regex = /\b(?:#{ Regexp.union(str.split('').permutation.map{ |a| a.join }).source })\b/
# => /(?:act|atc|cat|cta|tac|tca)/
%w[
cat act tca atc tac cta
ca ac cata
].each do |w|
puts '"%s" %s' % [w, w[regex] ? 'matches' : "doesn't match"]
end
That outputs:
"cat" matches
"act" matches
"tca" matches
"atc" matches
"tac" matches
"cta" matches
"ca" doesn't match
"ac" doesn't match
"cata" doesn't match
I use the technique of passing an array into Regexp.union for a lot of things; I works especially well with the keys of a hash, and passing the hash into gsub for rapid search/replace on text templates. This is the example from the gsub documentation:
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
Regexp.union creates a regex, and it's important to use source instead of to_s when extracting the actual pattern being generated:
puts regex.to_s
=> (?-mix:\b(?:act|atc|cat|cta|tac|tca)\b)
puts regex.source
=> \b(?:act|atc|cat|cta|tac|tca)\b
Notice how to_s embeds the pattern's flags inside the string. If you don't expect them you can accidentally embed that pattern into another, which won't behave as you expect. Been there, done that and have the dented helmet as proof.
If you really want to have fun, look into the Perl Regexp::Assemble module available on CPAN. Using that, plus List::Permutor, lets us generate more complex patterns. On a simple string like this it won't save much space, but on long strings or large arrays of desired hits it can make a huge difference. Unfortunately, Ruby has nothing like this, but it is possible to write a simple Perl script with the word or array of words, and have it generate the regex and pass it back:
use List::Permutor;
use Regexp::Assemble;
my $regex_assembler = Regexp::Assemble->new;
my $perm = new List::Permutor split('', 'act');
while (my #set = $perm->next) {
$regex_assembler->add(join('', #set));
}
print $regex_assembler->re, "\n";
(?-xism:(?:a(?:ct|tc)|c(?:at|ta)|t(?:ac|ca)))
See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for more information about using Regexp::Assemble with Ruby.
I will assume several things here:
- You are looking for permutations of given characters
- You are using ruby
str = "act"
permutations = str.split(//).permutation.map{|p| p.join("")}
# and for the actual test
permutations.include?("cat")
It is no regex though.
No doubt - the regex that uses positive/negative lookaheads and backreferences is slick, but if you're only dealing with three characters, I'd err on the side of verbosity by explicitly enumerating the character permutations like #scones suggested.
"act".split('').permutation.map(&:join)
=> ["act", "atc", "cat", "cta", "tac", "tca"]
And if you really need a regex out of it for scanning a larger string, you can always:
Regexp.union "act".split('').permutation.map(&:join)
=> /\b(act|atc|cat|cta|tac|tca)\b/
Obviously, this strategy doesn't scale if your search string grows, but it's much easier to observe the intent of code like this in my opinion.
EDIT: Added word boundaries for false positive on cata based on #theTinMan's feedback.

Remove Duplicate Numbers and Operators from String

I'm trying to take a string which is a simple math expression, remove all spaces, remove all duplicate operators, convert to single digit numbers, and then evaluate.
For example, an string like "2 7+*3*95" should be converted to "2+3*9" and then evaluated as 29.
Here's what I have so far:
expression.slice!(/ /) # Remove whitespace
expression.slice!(/\A([\+\-\*\/]+)/) # Remove operators from the beginning
expression.squeeze!("0123456789") # Single digit numbers (doesn't work)
expression.squeeze!("+-*/") # Removes duplicate operators (doesn't work)
expression.slice!(/([\+\-\*\/]+)\Z/) # Removes operators from the end
puts eval expression
Unfortunately this doesn't make single digit numbers nor remove duplicate operators quite like I expected. Any ideas?
"2 7+*3*95".gsub(/([0-9])[0-9 ]*/, '\1').gsub(/([\+\*\/\-])[ +\*\/\-]+/, '\1')
The first regex handles the single-digit thing and the second handles repeat operators. You could probably condense it into a single regex if you really wanted to.
This works for a quick-and-dirty solution, but you might be better served by a proper parser.
DATA.each { |expr|
expr.gsub!(%r'\s+', '')
expr.gsub!(%r'([*/+-])[*/+-]+', '\1')
expr.gsub!(%r'(\d)\d+', '\1')
expr.sub!(%r'\A[*/+-]+', '')
expr.sub!(%r'[*/+-]+\Z', '')
puts expr + ' = ' + eval(expr).to_s
}
__END__
2 7+*3*95
+-2 888+*3*95+8*-2/+435+-

What does %w(array) mean?

I'm looking at the documentation for FileUtils.
I'm confused by the following line:
FileUtils.cp %w(cgi.rb complex.rb date.rb), '/usr/lib/ruby/1.6'
What does the %w mean? Can you point me to the documentation?
%w(foo bar) is a shortcut for ["foo", "bar"]. Meaning it's a notation to write an array of strings separated by spaces instead of commas and without quotes around them. You can find a list of ways of writing literals in zenspider's quickref.
I think of %w() as a "word array" - the elements are delimited by spaces and it returns an array of strings.
Here are all % literals:
%w() array of strings
%r() regular expression.
%q() string
%x() a shell command (returning the output string)
%i() array of symbols (Ruby >= 2.0.0)
%s() symbol
%() (without letter) shortcut for %Q()
The delimiters ( and ) can be replaced with a lot of variations, like [ and ], |, !, etc.
When using a capital letter %W() you can use string interpolation #{variable}, similar to the " and ' string delimiters. This rule works for all the other % literals as well.
abc = 'a b c'
%w[1 2#{abc} d] #=> ["1", "2\#{abc}", "d"]
%W[1 2#{abc} d] #=> ["1", "2a b c", "d"]
There is also %s that allows you to create any symbols, for example:
%s|some words| #Same as :'some words'
%s[other words] #Same as :'other words'
%s_last example_ #Same as :'last example'
Since Ruby 2.0.0 you also have:
%i( a b c ) # => [ :a, :b, :c ]
%i[ a b c ] # => [ :a, :b, :c ]
%i_ a b c _ # => [ :a, :b, :c ]
# etc...
%W and %w allow you to create an Array of strings without using quotes and commas.
Though it's an old post, the question keep coming up and the answers don't always seem clear to me, so, here's my thoughts:
%w and %W are examples of General Delimited Input types, that relate to Arrays. There are other types that include %q, %Q, %r, %x and %i.
The difference between the upper and lower case version is that it gives us access to the features of single and double quotes. With single quotes and (lowercase) %w, we have no code interpolation (#{someCode}) and a limited range of escape characters that work (\\, \n). With double quotes and (uppercase) %W we do have access to these features.
The delimiter used can be any character, not just the open parenthesis. Play with the examples above to see that in effect.
For a full write up with examples of %w and the full list, escape characters and delimiters, have a look at "Ruby - %w vs %W – secrets revealed!"
Instead of %w() we should use %w[]
According to Ruby style guide:
Prefer %w to the literal array syntax when you need an array of words (non-empty strings without spaces and special characters in them). Apply this rule only to arrays with two or more elements.
# bad
STATES = ['draft', 'open', 'closed']
# good
STATES = %w[draft open closed]
Use the braces that are the most appropriate for the various kinds of percent literals.
[] for array literals(%w, %i, %W, %I) as it is aligned with the standard array literals.
# bad
%w(one two three)
%i(one two three)
# good
%w[one two three]
%i[one two three]
For more read here.
Excerpted from the documentation for Percent Strings at http://ruby-doc.org/core/doc/syntax/literals_rdoc.html#label-Percent+Strings:
Besides %(...) which creates a String, the % may create other types of object. As with strings, an uppercase letter allows interpolation and escaped characters while a lowercase letter disables them.
These are the types of percent strings in ruby:
...
%w: Array of Strings
I was given a bunch of columns from a CSV spreadsheet of full names of users and I needed to keep the formatting, with spaces. The easiest way I found to get them in while using ruby was to do:
names = %(Porter Smith
Jimmy Jones
Ronald Jackson).split("\n")
This highlights that %() creates a string like "Porter Smith\nJimmyJones\nRonald Jackson" and to get the array you split the string on the "\n" ["Porter Smith", "Jimmy Jones", "Ronald Jackson"]
So to answer the OP's original question too, they could have wrote %(cgi\ spaeinfilename.rb;complex.rb;date.rb).split(';') if there happened to be space when you want the space to exist in the final array output.

Resources