I'm trying to parse iCalendar (RFC2445) input using a regex.
Here's a [simplified] example of what the input looks like:
BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT
I'd like to get an array of matches: the "outer" match is each VEVENT block and the inner matches are each of the field:value pairs.
I've tried variants of this:
BEGIN:VEVENT\n((?<field>(?<name>\S+):\s*(?<value>\S+)\n)+?)END:VEVENT
But given the input above, the result seems to have only ONE field for each matching VEVENT, despite the +? on the capture group:
**Match 1**
field def:456
name def
value 456
**Match 2**
field ghi:789
name ghi
value 789
In the first match, I would have expected TWO fields: the abc:123 and the def:456 matches...
I'm sure this is a newbie mistake (since I seem to perpetually be a newbie when it comes to regex's...) - but maybe you can point me in the right direction?
Thanks!
You need to split your regex up into one matching a VEVENT and one matching the name/value pairs. You can then use nested scan to find all occurences, e. g.
str.scan(/BEGIN:VEVENT((?<vevent>.+?))END:VEVENT/m) do
$~[:vevent].scan(/(?<field>(?<name>\S+?):\s*(?<value>\S+?))/) do
p $~[:field], $~[:name], $~[:value]
end
end
where str is your input. This outputs:
"abc:1"
"abc"
"1"
"def:4"
"def"
"4"
"ghi:7"
"ghi"
"7"
If you want to make the code more readable, i suggest you require 'english' and replace $~ with $LAST_MATCH_INFO
Use the icalendar gem.
See the Parsing iCalendars section for more info.
You need a nested scan.
string.scan(/^BEGIN:VEVENT\n(.*?)\nEND:VEVENT$/m).each.with_index do |item, i|
puts
puts "**Match #{i+1}**"
item.first.scan(/^(.*?):(.*)$/) do |k, v|
puts "field".ljust(7)+"#{k}:#{v}"
puts "name".ljust(7)+"#{k}"
puts "value".ljust(7)+"#{v}"
end
end
will give:
**Match 1**
field abc:123
name abc
value 123
field def:456
name def
value 456
**Match 2**
field ghi:789
name ghi
value 789
I think the problem is that the ruby MatchData object, which is what the regexp returns its results in, doesn't have any provision for more than one value with the same name. So your second match overwrites the first one.
Ruby has a seldom used method called slice_before that fits this need well:
'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).to_a
Results in:
[["BEGIN:VEVENT", "abc:123", "def:456", "END:VEVENT"],
["BEGIN:VEVENT", "ghi:789", "END:VEVENT"]]
From there it's simple to grab just the inner array elements:
'BEGIN:VEVENT
abc:123
def:456
END:VEVENT
BEGIN:VEVENT
ghi:789
END:VEVENT'.split("\n").slice_before(/^BEGIN:VEVENT/).map{ |a| a[1 .. -2] }
Which is:
[["abc:123", "def:456"], ["ghi:789"]]
And, from there it's trivial to break up each resulting string using map and split(':').
Don't be seduced by the siren call of regular expressions trying to do everything. They're very powerful and convenient in their particular place, but often there are simpler and easier to maintain solutions.
Related
I'm not sure where I went wrong, but I created a class called phone number that takes a parameter of ph. ph gets converted to a string. So, when 123456789 is inputted like a.PhoneNumber(123456789) it should input out (123) 456-7890 and when a.area_code is called it should give #123, etc.
Why does none of my scan methods work correctly?
Can I use the split method to pull out only #123 from 1234567890?
class PhoneNumber
def initialize (ph)
#ph = ph
#ph.insert(0, '#')
#ph.scan(/.{0,1}/).join('( ')
#ph.scan(/.{3,4}/).join(')')
#ph.scan(/.{3,5}/).join('- ')
end
def to_s
#ph
end
def area_code
#ph.split(0,2)
end
end
print "Please enter the number: "
puts a = PhoneNumber.new(gets.strip)
puts a.area_code
The scan methods don't work because the Regex you are using are not doing what I think you think it should be doing.
.{0,1} matches anything between 0 and 1 character. That's why they just return the match string iteratively
#ph.scan(/.{0,1}/) #=> "#1234567890"
#ph.scan(/.{3,4}/) #=> "#1234567890"
#ph.scan(/.{4,5}/) #=> "#1234567890"
One possible way to fix this is to use indexes to get split the #ph in three parts. Alternatively, you can also use something like this to split the number in groups
#ph.scan(/(\d{3})(\d{3})(\d{4})/) #=> [["123", "456", "7890"]]
First argument to split must be a String or Regexp in #ph.split(0,2)
You can define area_code something like this if the first char of #ph is #
def area_code
#split[1,3]
end
I'm working through the Test First Ruby Master problems. My code for 08/book_titles is this:
class Book
attr_accessor :title
def title
if #title.include?(' ')
correct = #title.split.each_with_index.map {|x, index| ((x =~ /^a|an|of|the|or|in|and$/) && index != 0) ? x : x.capitalize}
correct.join(' ')
# this is throwing a weird error, the code looks right but isn't capitalizing last word (returns 'To Kill a mockingbird')
else #title.capitalize
end
end
end
I tested the map portion separately, and it works fine. But in the entirety of the problem, it does not capitalize as it should be. It throws an rspec error:
1) Book title should capitalize every word except... articles a
Failure/Error: expect(#book.title).to eq("To Kill a Mockingbird")
expected: "To Kill a Mockingbird"
got: "To Kill a mockingbird"
Anyone know why?
I originally didn't include ^/$ in the regex. I got the same error with a different title, and adding those anchors fixed it for that case. But then the error showed up again with the title.
Because mockingbird contains in
('mockingbird' =~ /^a|an|of|the|or|in|and$/) => 4
I think you want this regex:
/^a$|^an$|^of$|^the$|^or$|^in$|^and$/
It is not necessary to break the string into words, modify the words and join them back into a string. In fact, doing that has the disadvantage that spacing between words may be altered. Here's one way of operating on the string directly.
wee_words = ["a", "an", "of", "the", "or", "in", "and"]
str = "a dAy in the life of waltEr mITTY"
str.capitalize.gsub(/\w+/) { |s| wee_words.include?(s) ? s : s.capitalize }
#=> "A Day in the Life of Walter Mitty"
str.capitalize upcases the first letter of the string and downcases all subsequent letters. As a result, the first word will never be treated as a wee_word, since it is capitalized (e.g., wee_words.include?("The") #=> false).
The regex is slightly incorrect. The way to read it as it is can be done this way:
Match any string that
starts with 'a'
or contains 'an'
or contains 'of'
or contains 'the'
or contains 'or'
or contains 'in'
or ends in 'and'
What you really seem to want is something that reads like this:
Match any string that
only contains any of 'a', 'an', 'of', 'the', 'or', 'in', 'and'
To get this, you want your regex to be written like this:
/^(a|an|of|the|or|in|and)$/
Note the parenthesis around the alternation. (Alternation is the formal term for multiple choices in a regex, where choices are separated by '|').
If you're comparing against book or movie titles, this is much closer to the type of match you'd expect. It will match correctly for titles such as "Chariots of Fire" and "Benny and Joon", but not against falsely the 'in' of "To Kill a Mockingbird", which is a significant improvement.
However, it still won't quite work yet on something like "Benny AND Joon", because 'AND' is uppercase in this title (assuming that incoming titles may be arbitrarily mixed case). One last change will do it:
/^(a|an|of|the|or|in|and)$/i
That last letter 'i' at the end of the regex says to 'ignore case', so that matches can occur regardless of whether the 'AND' is uppercase, lowercase, or mixed case.
This should get you close to what you're trying to achieve and handle a few bumpy use cases in the process.
I am trying to figure out how to replace multiple characters in an array of strings by using multiple wildcards (or some other method if someone knows better.) Each element in the array is a telephone number and date, (ex. 8675309,2015-01-20). I am trying to remove the comma and date only so that each element in the array be the telephone number only
When iterating over each element in the array, I obtained expected results by calling .gsub! when replacing a single character each element.
file_data = ["8675309,2015-01-20"]
puts file_data[0] #=> 8675309,2015-01-20
file_data.each do |s|
s.gsub!(/0/, "X")
end
puts file_data[0] #> 86753X9,2X15-X1-2X
To eliminate the comma and date, I tried simply using wildcards, calling s.gsub!(",****/**/**", ""). Then, this shows unexpected results:
file_data = ["8675309,2015-01-20"]
file_data.each do |s|
s.gsub!(/,****-**-**/, "")
end
puts file_data[0] #> 8675309,2015-01-20
I also tried several other wildcard characters that have been suggested in other threads ('.' and '^'), but the results have not changed.
I am lost on how to eliminate the comma and date in each element while leaving the primary number intact. I thought .gsub! would be the proper method, but am open to any alternatives as well. Any help is appreciated.
At first glance, I might use String#split to get the phone number:
file_data = ["8675309,2015-01-20"]
phone_numbers = file_data.map {|s| s.split(',').first }
phone_numbers[0] #=> "8675309"
Or, if the phone number is always 7 characters, I might get a string subset with []:
file_data.map {|s| s[0,7] }
Or, if you really want to stick with a regular expression:
file_data.each do |s|
s.gsub!(/,.*\z/, '')
end
Which reads as: part of a string starting from the first comma to the end of the string, replace with nothing.
The way you are handling wildcards is excessive. Why are you using wildcards when you know what you want to sub? Removing commas and the date (as long as the date is always the same format) should be simple:
name = "8675309,2015-01-20"
name.gsub!(/,\d{4}-\d{2}-\d{2}/,"")
Use String#partition
name.partition(',')[0]
=>"8675309"
The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX
The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".
You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.
In the book I'm reading to learn Rails (RailsSpace) , the author creates two functions (below) to turn all caps city names like LOS ANGELES into Los Angeles. There's something I don't get about the first function, below, however.
Namely, where does "word" come from? I understand that "word" is a local/block variable that disappears after the function has been completed, but what is being passed into/assigned to "word." IN other words, what is being split?
I would have expected there to have been some kind of argument taking an array or hash passed into this function...and then the "each" function run over that..
def capitalize_each
space = " "
split(space).each{ |word| word.capitalize! }.join(space)
end
# Capitalize each word in place.
def capitalize_each!
replace capitalize_each end
end
Let's break this up.
split(space)
turns the string into a list of would-be words. (Actually, if the string has two spaces in a row, the list will have an empty string in it. but that doesn't matter for this purpose.) I assume this is an instance method in String; otherwise, split wouldn't be defined.
.each { |word| word.capitalize! }
.each takes each thing in the list (returned by split), and runs the following block on it, passing the thing as an arg to the block. The |word| says that this block is going to call the arg "word". So effectively, what this does is capitalize each word in the string (and each blank string and lonely bit of punctuation too, but again, that's not important -- capitalization doesn't change characters that have no concept of case).
.join(space)
glues the words back together, reinserting the space that was used to separate them before. The string it returns is the return value of the function as well.
At first I thought that the method was incomplete because of the absence of self at the beginning but it seems that even without it split is being called over the string given, space would simply be a default separator. This is how the method could look with explicit self.
class String
def capitalize_each(separator = ' ')
self.split(separator).each{|word| word.capitalize!}.join(separator)
end
end
puts "LOS ANGELES".capitalize_each #=> Los Angeles
puts "LOS_ANGELES".capitalize_each('_') #=> Los_Angeles
The string is being split by spaces, i.e. into words.
So the 'each' iterator goes through all the words, one by one, each time the word is in the 'word' object. So then for that object (word) it uses the capitalize function for it. Finally it all gets joined back together With Spaces. So The End Result is Capitalized.
These methods are meant to be defined in the String class, so what is being split is whatever string you are calling the capitalize_each method on.
Some example usage (and a slightly better implementation):
class String
def capitalize_each
split(/\s+/).each{ |word| word.capitalize! }.join " "
end
def capitalize_each!
replace capitalize_each
end
end
puts "hi, i'm a sentence".capitalize_each #=> Hi, I'm A Sentence
Think of |word| word.capitalize! as a function whch you're passing into the each method. The function has one argument (word) and simply evaluates .capitalize! on it.
Now what the each method is doing is taking each item in split(space) and evaluating your function on it. So:
"abcd".each{|x| print x}
will evaluate, in order, print "a", print "b", print "c".
http://www.ruby-doc.org/core/classes/Array.html#M000231
To demystify this behavior a bit, it helps to understand exactly what it means to "take each item in __". Basically, any object which is enumerable can be .eached in this way.
If you're referring to how it gets into your block in the first place, it's yielded into the block. #split returns an Array, and it's #each method is doing something along the lines of:
for object in stored_objects
yield object
end
This works, but if you want to turn one array into another array, it's idiomatically better to use map instead of each, like this:
words.map{|word|word.capitalize}
(Without the trailing !, capitalize makes a new string instead of modifying the old string, and map collects those new strings into a new array. In contrast, each returns the old array.)
Or, following gunn's lead:
class String
def capitalize_each
self.split(/\s/).map{|word|word.capitalize}.join(' ')
end
end
"foo bar baz".capitalize_each #=> "Foo Bar Baz"
by default, split splits on strings of spaces, but by passing a regular expression it matches each individual space characters even if they're in a row.