i'm writing a client to a third-party API, and they provide data in a weird format. At first, it might look like JSON but it's not, and i'm a bit confused about how i should handle that.
It's a key-value based format (much like JSON).
Keys are separated by '=' from their values.
Keys and values are wrapped within double-quotes.
Dictionaries start with '{' and end with '}'.
Arrays start with '('
and end with ')'
Lines end with ';' (Excepted for arrays content) and end-of-line character (\r i think).
Sometimes, there seem to be unicode (Stuff like \U2623 for the BioHazard sign) in strings.
What could possibly be this format? Shall i use a premade gem to parse it, or should i build my own parser?
{ "anArray" = (
"100",
"200",
"300"
);
"aDictionary" = {
"aString" = "Something";
};
}
EDIT This format seems to be Apple's property list, but it's not XML neither Binary... This make sense as the API is from a WebObjects webservice. i will try to use CFPropertyList gem to parse it, if there is a better solution, please let me know.
EDIT 2 This is a NextSTEP Property List.
Here's a robust answer using a custom StringScanner-based parser. It allows whitespace to be optional, allows trailing commas after the last item in a list and allows omitting the semicolon after the last dictionary key/value pair. It allows the outermost item to be an dictionary, array, or string. And it allows really any sort of legal string content, including parens and curly braces and escaped text like \n.
Seen in action:
p parse('{ "array" = ( "1", "2", ( "3", "4" ) ); "hash"={ "key"={ "more"="oh}]yes;!"; }; }; }')
#=> {"array"=>["1", "2", ["3", "4"]], "hash"=>{"key"=>{"more"=>"oh}]yes;!"}}}
puts parse('("Escaped \"Quotes\" Allowed", "And Unicode \u2623 OK")')
#=> Escaped "Quotes" Allowed
#=> And Unicode ☣ OK
The code:
require 'strscan'
def parse(str)
ss, getstr, getary, getdct = StringScanner.new(str)
getvalue = ->{
if ss.scan /\s*\{\s*/ then getdct[]
elsif ss.scan /\s*\(\s*/ then getary[]
elsif str = getstr[] then str
elsif ss.scan /\s*[)}]\s*/ then nil end
}
getstr = ->{
if str=ss.scan(/\s*"(?:[^"\\]|\\u\d+|\\.)*"\s*/i)
eval str.gsub(/([^\\](?:\\\\)*)#(?=[{#$])/,'\1\#')
end
}
getary = ->{
[].tap do |a|
while v=getvalue[]
a << v
ss.scan /\s*,\s*/
end
end
}
getdct = ->{
{}.tap do |h|
while key = getstr[]
ss.scan /\s*=\s*/
if value=getvalue[] then h[key]=value; ss.scan(/\s*;\s*/) end
end
end
end
}
getvalue[]
end
As an alternative to rolling your own parser from scratch in the future, you might also want to look into the Treetop Ruby library.
Edit: I've replaced the implementation of getstr above with one that should prevent running arbitrary Ruby code inside the eval. For more details, see "Eval a string without interpolation". Seen in action:
#secret = "OH NO!"
$secret = "OH NO!"
##secret = "OH NO!"
puts parse('"\"#{:NOT&&:very}\" bad. \u262E\n##secret \\#$secret \\\\###secret"')
Here's a very quick-and-dirty hack that transforms the syntax into valid Ruby and then evals it. Note that this could be dangerous. More importantly, this will convert all parentheses inside keys and values into square brackets.
def parse(str)
eval(
str
.gsub( /" = (?=[({"])/, '" => ' ) # Dictionary separators become =>
.gsub( /(?<=[)}"]); (?=[)}"])/, ', ' ) # Dictionary semicolons become ,
.tr( '()', '[]' ) # ALL parens become square brackets
)
end
p parse('{ "anArray" = ( "100", "200", "300" ); "aDictionary" = { "aString" = "Something"; }; }')
#=> {"anArray"=>["100", "200", "300"], "aDictionary"=>{"aString"=>"Something"}}
Related
I was recently asked this in an interview and was figuring out a way to do this without using regex in Ruby as I was told it would be a bonus if you can solve it without using regex.
Question: Assume that the hash has 1 million key, value pairs and we have to be able to sub the variables in the string that are between % % this pattern. How would I be able to do this without regex.
We have a string str = "%greet%! Hi there, %var_1% that can be any other %var_2% injected to the %var_3%. Nice!, goodbye)"
we have a hash called dict = { greet: 'Hi there', var_1: 'FIRST VARIABLE', var_2: 'values', var_3: 'string', }
This was my solution:
def template(str, dict)
vars = value.scan(/%(.*?)%/).flatten
vars.each do |var|
value = value.gsub("%#{var}%", dict[var.to_sym])
end
value
end
There are many ways to solve this, but you will probably need some kind of parsing and / or lexical analysis if you don't want to use built-in pattern matching.
Let's keep it very simple and say that your string's content falls into two categories: text and variable which are separated by %, e.g. (you could also think of the variables being enclosed by %, but that's harder to implement)
str = "Hello %name%, hope to see you %when%!"
# TTTTTT VVVV TTTTTTTTTTTTTTTTTT VVVV T
As you can see, the categories are alternating. We can utilize this and write a little helper method that turns a string into a list of [type, value] pairs, something like this:
def each_part(str)
return enum_for(__method__, str) unless block_given?
type = [:text, :var].cycle
buf = ''
str.each_char do |char|
if char != '%'
buf << char
else
yield type.next, buf
buf = ''
end
end
yield type.next, buf
end
It starts by defining an enumerator that will cycle between the two types and an empty buffer. It will then read each_char from the string. If the char is not %, it will just append it to the buffer and keep reading. Once it encounters a %, it will yield the current buffer along with the type and start a new buffer (next will also switch the type). After the loop ends, it will yield once more to output the remaining characters.
It outputs this kind of data:
each_part(str).to_a
#=> [[:text, "Hello "],
# [:var, "name"],
# [:text, ", hope to see you "],
# [:var, "when"],
# [:text, "!"]]
We can use this to convert the string:
dict = { name: 'Tom', when: 'soon' }
output = ''
each_part(str) do |type, value|
case type
when :text
output << value
when :var
output << dict[value.to_sym]
end
end
p output
#=> "Hello Tom, hope to see you soon!"
You could of course combine parsing and evaluation, but I like the separation. An full-fledged parser might involve even more steps.
A very simple approach:
First, split the string on '%':
str = "%greet%! Hi there, %var_1% that can be any other %var_2% injected to the %var_3%. Nice!, goodbye)"
chunks = str.split('%')
Now we can assume given the way the problem has been specified, that every other "chunk" will be a key to replace. Iterating with the index will make that easier to figure out.
chunks.each_with_index { |c, i| chunks[i] = (i.even? ? c : dict[c.to_sym]) }.join
Result:
"Hi there! Hi there, FIRST VARIABLE that can be any other values injected to the string. Nice!, goodbye)"
Note: this does not handle malformed input well at all.
I have:
event = {"first_type_a" => 100, "first_type_b" => false, "second_type_a" => "abc", "second_type_b" => false}
I am trying to change the keys of event based on the original. If the name contains a substring, the name will be changed, and the new key, pair value should be added, and the old pair should be removed. I expect to get:
event = {"important.a_1" => 100, "important.b_1" => false, "second_type_a" => "abc", "second_type_b" => false}
What would be the most efficient way to update event?
I expected this to work:
event.each_pair { |k, v|
if !!(/first_type_*/ =~ k) do
key = "#{["important", k.split("_", 3)[2]].join(".")}";
event.merge!(key: v);
event.delete(k)
end
}
but it raises an error:
simpleLoop.rb:5: syntax error, unexpected keyword_do, expecting keyword_then or ';' or '\n'
... if !!(/first_type_*/ =~ k) do
... ^~
simpleLoop.rb:9: syntax error, unexpected keyword_end, expecting '}'
end
^~~
simpleLoop.rb:21: embedded document meets end of file
I thought to approach it differently:
if result = event.keys.find { |k| k.include? "first_type_" } [
key = "#{["important", k.split("_", 3)[2]].join(".")}"
event.merge!(key: v)
event.delete(k)
]
but still no luck. I am counting all brackets, and as the error indicates, it is something there, but I can't find it. Does the order matter?
I will just show how this can be done quite economically in Ruby. Until recently, when wanting to modify keys of a hash one usually would do one of two things:
create a new empty hash and then add key/values pairs to the hash; or
convert the hash to an array a of two-element arrays (key-value pairs), modify the first element of each element of a (the key) and then convert a to the desired hash.
Recently (with MRI v2.4) the Ruby monks bestowed on us the handy methods Hash#transform_keys and Hash#transform_keys!. We can use the first of these profitably here. First we need a regular expression to match keys.
r = /
\A # match beginning of string
first_type_ # match string
(\p{Lower}+) # match 1+ lowercase letters in capture group 1
\z # match the end of the string
/x # free-spacing regex definition mode
Conventionally, this is written
r = /\Afirst_type_(\p{Lower}+)\z/
The use of free-spacing mode makes the regex self-documenting. We now apply the transform_keys method, together with the method String#sub and the regex just defined.
event = {"first_type_a"=>100, "first_type_b"=>false,
"second_type_a"=>"abc", "second_type_b"=>false}
event.transform_keys { |k| k.sub(r, "important.#{'\1'}_1") }
#=> {"important.a_1"=>100, "important.b_1"=>false,
# "second_type_a"=>"abc", "second_type_b"=>false}
In the regex the p{} construct expression \p{Lower} could be replaced with \p{L}, the POSIX bracket expression [[:lower:]] (both match Unicode letters) or [a-z], but the last has the disadvantage that it will not match letters with diacritical marks. That includes letters of words borrowed from other languages that are used in English text (such as rosé, the wine). Search Regexp for documentation of POSIX and \p{} expressions.
If "first_type_" could be followed by lowercase or uppercase letters use \p{Alpha}; if it could be followed by alphanumeric characters, use \p{Alnum}, and so on.
The if starts a block, such as other structures if ... else ... end do ... end begin ... rescue ... end
Therefore your first example remove the do after the if, the block is already open. I also made it clearer by changing the block after each_pair to use do ... end rather than braces to help avoid confusing a hash with a block.
event = { 'first_type_a' => 100, 'first_type_b' => false, 'second_type_a' => 'abc', 'second_type_b' => false }
new_event = {}
event.each_pair do |k, v|
if !!(/first_type_*/ =~ k)
important_key = ['important', k.split('_', 3)[2]].join('.')
new_event[important_key] = v
else
new_event[k] = v
end
end
You could define a method to be used inside the transform key call:
def transform(str)
return ['important.', str.split('_').last, '_1'].join() if str[0..4] == 'first' # I checked just for "first"
str
end
event.transform_keys! { |k| transform(k) } # Ruby >= 2.5
event.map { |k, v| [transform(k), v] }.to_h # Ruby < 2.5
Using Hash#each_pair but with Enumerable#each_with_object:
event.each_pair.with_object({}) { |(k, v), h| h[transform(k)] = v }
Or use as one liner:
event.transform_keys { |str| str[0..4] == 'first' ? ['important.', str.split('_').last, '_1'].join() : str }
I'm attempting to write a function that takes a string and returns it with all vowels removed. Below is my code.
def vowel(str)
result = ""
new = str.split(" ")
i = 0
while i < new.length
if new[i] == "a"
i = i + 1
elsif new[i] != "a"
result = new[i] + result
end
i = i + 1
end
return result
end
When I run the code, it returns the exact string that I entered for (str). For example, if I enter "apple", it returns "apple".
This was my original code. It had the same result.
def vowel(str)
result = ""
new = str.split(" ")
i = 0
while i < new.length
if new[i] != "a"
result = new[i] + result
end
i = i + 1
end
return result
end
I need to know what I am doing wrong using this methodology. What am I doing wrong?
Finding the bug
Let's see what's wrong with your original code by executing your method's code in IRB:
$ irb
irb(main):001:0> str = "apple"
#=> "apple"
irb(main):002:0> new = str.split(" ")
#=> ["apple"]
Bingo! ["apple"] is not the expected result. What does the documentation for String#split say?
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array of these substrings.
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
Our pattern is a single space, so split returns an array of words. This is definitely not what we want. To get the desired result, i.e. an array of characters, we could pass an empty string as the pattern:
irb(main):003:0> new = str.split("")
#=> ["a", "p", "p", "l", "e"]
"split on empty string" feels a bit hacky and indeed there's another method that does exactly what we want: String#chars
chars → an_array
Returns an array of characters in str. This is a shorthand for str.each_char.to_a.
Let's give it a try:
irb(main):004:0> new = str.chars
#=> ["a", "p", "p", "l", "e"]
Perfect, just as advertised.
Another bug
With the new method in place, your code still doesn't return the expected result (I'm going to omit the IRB prompt from now on):
vowel("apple") #=> "elpp"
This is because
result = new[i] + result
prepends the character to the result string. To append it, we have to write
result = result + new[i]
Or even better, use the append method String#<<:
result << new[i]
Let's try it:
def vowel(str)
result = ""
new = str.chars
i = 0
while i < new.length
if new[i] != "a"
result << new[i]
end
i = i + 1
end
return result
end
vowel("apple") #=> "pple"
That looks good, "a" has been removed ("e" is still there, because you only check for "a").
Now for some refactoring.
Removing the explicit loop counter
Instead of a while loop with an explicit loop counter, it's more idiomatic to use something like Integer#times:
new.length.times do |i|
# ...
end
or Range#each:
(0...new.length).each do |i|
# ...
end
or Array#each_index:
new.each_index do |i|
# ...
end
Let's apply the latter:
def vowel(str)
result = ""
new = str.chars
new.each_index do |i|
if new[i] != "a"
result << new[i]
end
end
return result
end
Much better. We don't have to worry about initializing the loop counter (i = 0) or incrementing it (i = i + 1) any more.
Avoiding character indices
Instead of iterating over the character indices via each_index:
new.each_index do |i|
if new[i] != "a"
result << new[i]
end
end
we can iterate over the characters themselves using Array#each:
new.each do |char|
if char != "a"
result << char
end
end
Removing the character array
We don't even have to create the new character array. Remember the documentation for chars?
This is a shorthand for str.each_char.to_a.
String#each_char passes each character to the given block:
def vowel(str)
result = ""
str.each_char do |char|
if char != "a"
result << char
end
end
return result
end
The return keyword is optional. We could just write result instead of return result, because a method's return value is the last expression that was evaluated.
Removing the explicit string
Ruby even allows you to pass an object into the loop using Enumerator#with_object, thus eliminating the explicit result string:
def vowel(str)
str.each_char.with_object("") do |char, result|
if char != "a"
result << char
end
end
end
with_object passes "" into the block as result and returns it (after the characters have been appended within the block). It is also the last expression in the method, i.e. its return value.
You could also use if as a modifier, i.e.:
result << char if char != "a"
Alternatives
There are many different ways to remove characters from a string.
Another approach is to filter out the vowel characters using Enumerable#reject (it returns a new array containing the remaining characters) and then join the characters (see Nathan's answer for a version to remove all vowels):
def vowel(str)
str.each_char.reject { |char| char == "a" }.join
end
For basic operations like string manipulation however, Ruby usually already provides a method. Check out the other answers for built-in alternatives:
str.delete('aeiouAEIOU') as shown in Gagan Gami's answer
str.tr('aeiouAEIOU', '') as shown in Cary Swoveland's answer
str.gsub(/[aeiou]/i, '') as shown in Avinash Raj's answer
Naming things
Cary Swoveland pointed out that vowel is not the best name for your method. Choose the names for your methods, variables and classes carefully. It's desirable to have a short and succinct method name, but it should also communicate its intent.
vowel(str) obviously has something to do with vowels, but it's not clear what it is. Does it return a vowel or all vowels from str? Does it check whether str is a vowel or contains a vowel?
remove_vowels or delete_vowels would probably be a better choice.
Same for variables: new is an array of characters. Why not call it characters (or chars if space is an issue)?
Bottom line: read the fine manual and get to know your tools. Most of the time, an IRB session is all you need to debug your code.
I should use regex.
str.gsub(/[aeiou]/i, "")
> string= "This Is my sAmple tExt to removE vowels"
#=> "This Is my sAmple tExt to removE vowels"
> string.delete 'aeiouAEIOU'
#=> "Ths s my smpl txt t rmv vwls"
You can create a method like this:
def remove_vowel(str)
result = str.delete 'aeiouAEIOU'
return result
end
remove_vowel("Hello World, This is my sample text")
# output : "Hll Wrld, Ths s my smpl txt"
Live Demo
Assuming you're trying to learn about the basics of programming, rather than finding the quickest one-liner to do this (which would be to use a regular expression as Avinash has said), you have a number of problems with your code you need to change.
new = str.split(" ")
This line is likely the culprit, because it splits the string based on spaces. So your input string would have to be "a p p l e" to have the effect you're looking for.
new = str.split("")
You should also remove the duplicate i = i+1 once you've changed that.
As others have already identified the problems with the OP's code, I will merely suggest an alternative; namely, you could use String#tr:
"Now is the time for all good people...".tr('aeiouAEIOU', '')
#=> "Nw s th tm fr ll gd ppl..."
If regex is not allowed, you can do it this way:
def remove_vowels(string)
string.split("").delete_if { |letter| %w[a e i o u].include? letter }.join
end
Is it possible to do some ASCII options in Ruby, like what we did in Cpp?
char *s = "test string";
for(int i = 0 ; i < strlen(s) ; i++) printf("%c",s[i]);
// expected output: vguv"uvtkpi
How do I achieve a similar goal in Ruby? From some research I think String.each_byte might help here, but I'm thinking to use high order programming (something like Array.map) to translate the string directly, without using an explicit for loop.
The task I'm trying to solve: Referring to this page, I'm trying to solve it using Ruby, and it seems a character-by-character translation is needed to apply to the string.
Pay close attention to the hint given by the question in the Challenge, then use String's tr method:
"test string".tr('a-z', 'c-zab')
# => "vguv uvtkpi"
An additional hint to solve the problem is, you should only be processing characters. Punctuation and spaces should be left alone.
Use the above tr on the string in the Python Challenge, and you'll see what I mean.
Use String#each_char and String#ord and Integer#chr:
s = "test string"
s.each_char.map { |ch| (ch.ord + 2).chr }.join
# => "vguv\"uvtkpi"
or String#each_byte:
s.each_byte.map { |b| (b + 2).chr }.join
# => "vguv\"uvtkpi"
or String#next:
s.each_char.map { |ch| ch.next.next }.join
# => "vguv\"uvtkpi"
You can use codepoints or each_codepoint methods, for example:
old_string = 'test something'
new_string = ''
old_string.each_codepoint {|x| new_string << (x+2).chr}
p new_string #=> "vguv\"uqogvjkpi"
I have a string:
s="123--abc,123--abc,123--abc"
I tried using Ruby 1.9's new feature "named groups" to fetch all named group info:
/(?<number>\d*)--(?<chars>\s*)/
Is there an API like Python's findall which returns a matchdata collection? In this case I need to return two matches, because 123 and abc repeat twice. Each match data contains of detail of each named capture info so I can use m['number'] to get the match value.
Named captures are suitable only for one matching result.
Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:
irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"
irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"
Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:
matches = []
foo.scan(regex){ matches << $~ }
matches now contains the MatchData objects that correspond to scanning the string.
You can extract the used variables from the regexp using names method. So what I did is, I used regular scan method to get the matches, then zipped names and every match to create a Hash.
class String
def scan2(regexp)
names = regexp.names
scan(regexp).collect do |match|
Hash[names.zip(match)]
end
end
end
Usage:
>> "aaa http://www.google.com.tr aaa https://www.yahoo.com.tr ddd".scan2 /(?<url>(?<protocol>https?):\/\/[\S]+)/
=> [{"url"=>"http://www.google.com.tr", "protocol"=>"http"}, {"url"=>"https://www.yahoo.com.tr", "protocol"=>"https"}]
#Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:
s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]
s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.
s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}
This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.
If using ruby >=1.9 and the named captures, you could:
class String
def scan2(regexp2_str, placeholders = {})
return regexp2_str.to_re(placeholders).match(self)
end
def to_re(placeholders = {})
re2 = self.dup
separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
#Search for the pattern placeholders and replace them with the regex
placeholders.each do |placeholder, regex|
re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
end
return Regexp.new(re2, Regexp::MULTILINE) #Returns regex using named captures.
end
end
Usage (ruby >=1.9):
> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
or
> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m
> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"
Using the separator option:
> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
I needed something similar recently. This should work like String#scan, but return an array of MatchData objects instead.
class String
# This method will return an array of MatchData's rather than the
# array of strings returned by the vanilla `scan`.
def match_all(regex)
match_str = self
match_datas = []
while match_str.length > 0 do
md = match_str.match(regex)
break unless md
match_datas << md
match_str = md.post_match
end
return match_datas
end
end
Running your sample data in the REPL results in the following:
> "123--abc,123--abc,123--abc".match_all(/(?<number>\d*)--(?<chars>[a-z]*)/)
=> [#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">]
You may also find my test code useful:
describe String do
describe :match_all do
it "it works like scan, but uses MatchData objects instead of arrays and strings" do
mds = "ABC-123, DEF-456, GHI-098".match_all(/(?<word>[A-Z]+)-(?<number>[0-9]+)/)
mds[0][:word].should == "ABC"
mds[0][:number].should == "123"
mds[1][:word].should == "DEF"
mds[1][:number].should == "456"
mds[2][:word].should == "GHI"
mds[2][:number].should == "098"
end
end
end
I really liked #Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)
class String
def scan2(regexp)
names = regexp.names
captures = Hash.new
scan(regexp).collect do |match|
nzip = names.zip(match)
nzip.each do |m|
captgrp = m[0].to_sym
captures.add(captgrp, m[1])
end
end
return captures
end
end
Now, if you do
p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
You get
{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
(ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!
A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):
scan2.rb:
class String
#Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
#Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
#the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
#Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
#but is needed for the method to see the names to be used as indices.
def scan2(regexp2_str, mark='#')
regexp = regexp2_str.to_re(mark) #Evaluates the strings. Note: Must be reachable from here!
hash_indices_array = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
match_array = self.scan(regexp)
#Save matches in hash indexed by string variable names:
match_hash = Hash.new
match_array.flatten.each_with_index do |m, i|
match_hash[hash_indices_array[i].to_sym] = m
end
return match_hash
end
def to_re(mark='#')
re = /#{mark}(.*?)#{mark}/
return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE) #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
end
end
Example usage (irb1.9):
> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}
Notes:
Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.
Piggybacking off of Mark Hubbart's answer, I added the following monkey-patch:
class ::Regexp
def match_all(str)
matches = []
str.scan(self) { matches << $~ }
matches
end
end
which can be used as /(?<letter>\w)/.match_all('word'), and returns:
[#<MatchData "w" letter:"w">, #<MatchData "o" letter:"o">, #<MatchData "r" letter:"r">, #<MatchData "d" letter:"d">]
This relies on, as others have said, the use of $~ in the scan block for the match data.
I like the match_all given by John, but I think it has an error.
The line:
match_datas << md
works if there are no captures () in the regex.
This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.
I think in the case where there are captures () in regex, the correct code should be:
match_datas << md[1]
The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.
Things are complex - which may be why match_all was not included in native MatchData.