I have a string:
s="123--abc,123--abc,123--abc"
I tried using Ruby 1.9's new feature "named groups" to fetch all named group info:
/(?<number>\d*)--(?<chars>\s*)/
Is there an API like Python's findall which returns a matchdata collection? In this case I need to return two matches, because 123 and abc repeat twice. Each match data contains of detail of each named capture info so I can use m['number'] to get the match value.
Named captures are suitable only for one matching result.
Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:
irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"
irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"
Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:
matches = []
foo.scan(regex){ matches << $~ }
matches now contains the MatchData objects that correspond to scanning the string.
You can extract the used variables from the regexp using names method. So what I did is, I used regular scan method to get the matches, then zipped names and every match to create a Hash.
class String
def scan2(regexp)
names = regexp.names
scan(regexp).collect do |match|
Hash[names.zip(match)]
end
end
end
Usage:
>> "aaa http://www.google.com.tr aaa https://www.yahoo.com.tr ddd".scan2 /(?<url>(?<protocol>https?):\/\/[\S]+)/
=> [{"url"=>"http://www.google.com.tr", "protocol"=>"http"}, {"url"=>"https://www.yahoo.com.tr", "protocol"=>"https"}]
#Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:
s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]
s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.
s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}
This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.
If using ruby >=1.9 and the named captures, you could:
class String
def scan2(regexp2_str, placeholders = {})
return regexp2_str.to_re(placeholders).match(self)
end
def to_re(placeholders = {})
re2 = self.dup
separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
#Search for the pattern placeholders and replace them with the regex
placeholders.each do |placeholder, regex|
re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
end
return Regexp.new(re2, Regexp::MULTILINE) #Returns regex using named captures.
end
end
Usage (ruby >=1.9):
> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
or
> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m
> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"
Using the separator option:
> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
I needed something similar recently. This should work like String#scan, but return an array of MatchData objects instead.
class String
# This method will return an array of MatchData's rather than the
# array of strings returned by the vanilla `scan`.
def match_all(regex)
match_str = self
match_datas = []
while match_str.length > 0 do
md = match_str.match(regex)
break unless md
match_datas << md
match_str = md.post_match
end
return match_datas
end
end
Running your sample data in the REPL results in the following:
> "123--abc,123--abc,123--abc".match_all(/(?<number>\d*)--(?<chars>[a-z]*)/)
=> [#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">]
You may also find my test code useful:
describe String do
describe :match_all do
it "it works like scan, but uses MatchData objects instead of arrays and strings" do
mds = "ABC-123, DEF-456, GHI-098".match_all(/(?<word>[A-Z]+)-(?<number>[0-9]+)/)
mds[0][:word].should == "ABC"
mds[0][:number].should == "123"
mds[1][:word].should == "DEF"
mds[1][:number].should == "456"
mds[2][:word].should == "GHI"
mds[2][:number].should == "098"
end
end
end
I really liked #Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)
class String
def scan2(regexp)
names = regexp.names
captures = Hash.new
scan(regexp).collect do |match|
nzip = names.zip(match)
nzip.each do |m|
captgrp = m[0].to_sym
captures.add(captgrp, m[1])
end
end
return captures
end
end
Now, if you do
p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
You get
{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
(ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!
A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):
scan2.rb:
class String
#Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
#Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
#the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
#Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
#but is needed for the method to see the names to be used as indices.
def scan2(regexp2_str, mark='#')
regexp = regexp2_str.to_re(mark) #Evaluates the strings. Note: Must be reachable from here!
hash_indices_array = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
match_array = self.scan(regexp)
#Save matches in hash indexed by string variable names:
match_hash = Hash.new
match_array.flatten.each_with_index do |m, i|
match_hash[hash_indices_array[i].to_sym] = m
end
return match_hash
end
def to_re(mark='#')
re = /#{mark}(.*?)#{mark}/
return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE) #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
end
end
Example usage (irb1.9):
> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}
Notes:
Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.
Piggybacking off of Mark Hubbart's answer, I added the following monkey-patch:
class ::Regexp
def match_all(str)
matches = []
str.scan(self) { matches << $~ }
matches
end
end
which can be used as /(?<letter>\w)/.match_all('word'), and returns:
[#<MatchData "w" letter:"w">, #<MatchData "o" letter:"o">, #<MatchData "r" letter:"r">, #<MatchData "d" letter:"d">]
This relies on, as others have said, the use of $~ in the scan block for the match data.
I like the match_all given by John, but I think it has an error.
The line:
match_datas << md
works if there are no captures () in the regex.
This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.
I think in the case where there are captures () in regex, the correct code should be:
match_datas << md[1]
The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.
Things are complex - which may be why match_all was not included in native MatchData.
Related
Is there a quick way to find every match of a regular expression in Ruby? I've looked through the Regex object in the Ruby STL and searched on Google to no avail.
Using scan should do the trick:
string.scan(/regex/)
To find all the matching strings, use String's scan method.
str = "A 54mpl3 string w1th 7 numb3rs scatter36 ar0und"
str.scan(/\d+/)
#=> ["54", "3", "1", "7", "3", "36", "0"]
If you want, MatchData, which is the type of the object returned by the Regexp match method, use:
str.to_enum(:scan, /\d+/).map { Regexp.last_match }
#=> [#<MatchData "54">, #<MatchData "3">, #<MatchData "1">, #<MatchData "7">, #<MatchData "3">, #<MatchData "36">, #<MatchData "0">]
The benefit of using MatchData is that you can use methods like offset:
match_datas = str.to_enum(:scan, /\d+/).map { Regexp.last_match }
match_datas[0].offset(0)
#=> [2, 4]
match_datas[1].offset(0)
#=> [7, 8]
See these questions if you'd like to know more:
"How do I get the match data for all occurrences of a Ruby regular expression in a string?"
"Ruby regular expression matching enumerator with named capture support"
"How to find out the starting point for each match in ruby"
Reading about special variables $&, $', $1, $2 in Ruby will be helpful too.
if you have a regexp with groups:
str="A 54mpl3 string w1th 7 numbers scatter3r ar0und"
re=/(\d+)[m-t]/
you can use String's scan method to find matching groups:
str.scan re
#> [["54"], ["1"], ["3"]]
To find the matching pattern:
str.to_enum(:scan,re).map {$&}
#> ["54m", "1t", "3r"]
You can use string.scan(your_regex).flatten. If your regex contains groups, it will return in a single plain array.
string = "A 54mpl3 string w1th 7 numbers scatter3r ar0und"
your_regex = /(\d+)[m-t]/
string.scan(your_regex).flatten
=> ["54", "1", "3"]
Regex can be a named group as well.
string = 'group_photo.jpg'
regex = /\A(?<name>.*)\.(?<ext>.*)\z/
string.scan(regex).flatten
You can also use gsub, it's just one more way if you want MatchData.
str.gsub(/\d/).map{ Regexp.last_match }
If you have capture groups () inside the regex for other purposes, the proposed solutions with String#scan and String#match are problematic:
String#scan only get what is inside the capture groups;
String#match only get the first match, rejecting all the others;
String#matches (proposed function) get all the matches.
On this case, we need a solution to match the regex without considering the capture groups.
String#matches
With the Refinements you can monkey patch the String class, implement the String#matches and this method will be available inside the scope of the class that is using the refinement. It is an incredible way to Monkey Patch classes on Ruby.
Setup
/lib/refinements/string_matches.rb
# This module add a String refinement to enable multiple String#match()s
# 1. `String#scan` only get what is inside the capture groups (inside the parens)
# 2. `String#match` only get the first match
# 3. `String#matches` (proposed function) get all the matches
module StringMatches
refine String do
def matches(regex)
scan(/(?<matching>#{regex})/).flatten
end
end
end
Used: named capture groups
Usage
rails c
> require 'refinements/string_matches'
> using StringMatches
> 'function(1, 2, 3) + function(4, 5, 6)'.matches(/function\((\d), (\d), (\d)\)/)
=> ["function(1, 2, 3)", "function(4, 5, 6)"]
> 'function(1, 2, 3) + function(4, 5, 6)'.scan(/function\((\d), (\d), (\d)\)/)
=> [["1", "2", "3"], ["4", "5", "6"]]
> 'function(1, 2, 3) + function(4, 5, 6)'.match(/function\((\d), (\d), (\d)\)/)[0]
=> "function(1, 2, 3)"
Return an array of MatchData objects
#scan is very limited--only returns a simple array of strings!
Far more powerful/flexible for us to get an array of MatchData objects.
I'll provide two approaches (using same logic), one using a PORO and one using a monkey patch:
PORO:
class MatchAll
def initialize(string, pattern)
raise ArgumentError, 'must pass a String' unless string.is_a?(String)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
#string = string
#pattern = pattern
#matches = []
end
def match_all
recursive_match
end
private
def recursive_match(prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = #string.match(#pattern, index)
return #matches unless matching_item.present?
#matches << matching_item
recursive_match(matching_item)
end
end
USAGE:
test_string = 'a green frog jumped on a green lilypad'
MatchAll.new(test_string, /green/).match_all
=> [#<MatchData "green", #<MatchData "green"]
Monkey patch
I don't typically condone monkey-patching, but in this case:
we're doing it the right way by "quarantining" our patch into its own module
I prefer this approach because 'string'.match_all(/pattern/) is more intuitive (and looks a lot nicer) than MatchAll.new('string', /pattern/).match_all
module RubyCoreExtensions
module String
module MatchAll
def match_all(pattern)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
recursive_match(pattern)
end
private
def recursive_match(pattern, matches = [], prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = self.match(pattern, index)
return matches unless matching_item.present?
matches << matching_item
recursive_match(pattern, matches, matching_item)
end
end
end
end
I recommend creating a new file and putting the patch (assuming you're using Rails) there /lib/ruby_core_extensions/string/match_all.rb
To use our patch we need to make it available:
# within application.rb
require './lib/ruby_core_extensions/string/match_all.rb'
Then be sure to include it in the String class (you could put this wherever you want; but for example, right under the require statement we just wrote above. After you include it once, it will be available everywhere, even outside the class where you included it).
String.include RubyCoreExtensions::String::MatchAll
USAGE: And now when you use #match_all you get results like:
test_string = 'hello foo, what foo are you going to foo today?'
test_string.match_all /foo/
=> [#<MatchData "foo", #<MatchData "foo", #<MatchData "foo"]
test_string.match_all /hello/
=> [#<MatchData "hello"]
test_string.match_all /none/
=> []
I find this particularly useful when I want to match multiple occurrences, and then get useful information about each occurrence, such as which index the occurrence starts and ends (e.g. match.offset(0) => [first_index, last_index])
I'm attempting to write a function that takes a string and returns it with all vowels removed. Below is my code.
def vowel(str)
result = ""
new = str.split(" ")
i = 0
while i < new.length
if new[i] == "a"
i = i + 1
elsif new[i] != "a"
result = new[i] + result
end
i = i + 1
end
return result
end
When I run the code, it returns the exact string that I entered for (str). For example, if I enter "apple", it returns "apple".
This was my original code. It had the same result.
def vowel(str)
result = ""
new = str.split(" ")
i = 0
while i < new.length
if new[i] != "a"
result = new[i] + result
end
i = i + 1
end
return result
end
I need to know what I am doing wrong using this methodology. What am I doing wrong?
Finding the bug
Let's see what's wrong with your original code by executing your method's code in IRB:
$ irb
irb(main):001:0> str = "apple"
#=> "apple"
irb(main):002:0> new = str.split(" ")
#=> ["apple"]
Bingo! ["apple"] is not the expected result. What does the documentation for String#split say?
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array of these substrings.
If pattern is a String, then its contents are used as the delimiter when splitting str. If pattern is a single space, str is split on whitespace, with leading whitespace and runs of contiguous whitespace characters ignored.
Our pattern is a single space, so split returns an array of words. This is definitely not what we want. To get the desired result, i.e. an array of characters, we could pass an empty string as the pattern:
irb(main):003:0> new = str.split("")
#=> ["a", "p", "p", "l", "e"]
"split on empty string" feels a bit hacky and indeed there's another method that does exactly what we want: String#chars
chars → an_array
Returns an array of characters in str. This is a shorthand for str.each_char.to_a.
Let's give it a try:
irb(main):004:0> new = str.chars
#=> ["a", "p", "p", "l", "e"]
Perfect, just as advertised.
Another bug
With the new method in place, your code still doesn't return the expected result (I'm going to omit the IRB prompt from now on):
vowel("apple") #=> "elpp"
This is because
result = new[i] + result
prepends the character to the result string. To append it, we have to write
result = result + new[i]
Or even better, use the append method String#<<:
result << new[i]
Let's try it:
def vowel(str)
result = ""
new = str.chars
i = 0
while i < new.length
if new[i] != "a"
result << new[i]
end
i = i + 1
end
return result
end
vowel("apple") #=> "pple"
That looks good, "a" has been removed ("e" is still there, because you only check for "a").
Now for some refactoring.
Removing the explicit loop counter
Instead of a while loop with an explicit loop counter, it's more idiomatic to use something like Integer#times:
new.length.times do |i|
# ...
end
or Range#each:
(0...new.length).each do |i|
# ...
end
or Array#each_index:
new.each_index do |i|
# ...
end
Let's apply the latter:
def vowel(str)
result = ""
new = str.chars
new.each_index do |i|
if new[i] != "a"
result << new[i]
end
end
return result
end
Much better. We don't have to worry about initializing the loop counter (i = 0) or incrementing it (i = i + 1) any more.
Avoiding character indices
Instead of iterating over the character indices via each_index:
new.each_index do |i|
if new[i] != "a"
result << new[i]
end
end
we can iterate over the characters themselves using Array#each:
new.each do |char|
if char != "a"
result << char
end
end
Removing the character array
We don't even have to create the new character array. Remember the documentation for chars?
This is a shorthand for str.each_char.to_a.
String#each_char passes each character to the given block:
def vowel(str)
result = ""
str.each_char do |char|
if char != "a"
result << char
end
end
return result
end
The return keyword is optional. We could just write result instead of return result, because a method's return value is the last expression that was evaluated.
Removing the explicit string
Ruby even allows you to pass an object into the loop using Enumerator#with_object, thus eliminating the explicit result string:
def vowel(str)
str.each_char.with_object("") do |char, result|
if char != "a"
result << char
end
end
end
with_object passes "" into the block as result and returns it (after the characters have been appended within the block). It is also the last expression in the method, i.e. its return value.
You could also use if as a modifier, i.e.:
result << char if char != "a"
Alternatives
There are many different ways to remove characters from a string.
Another approach is to filter out the vowel characters using Enumerable#reject (it returns a new array containing the remaining characters) and then join the characters (see Nathan's answer for a version to remove all vowels):
def vowel(str)
str.each_char.reject { |char| char == "a" }.join
end
For basic operations like string manipulation however, Ruby usually already provides a method. Check out the other answers for built-in alternatives:
str.delete('aeiouAEIOU') as shown in Gagan Gami's answer
str.tr('aeiouAEIOU', '') as shown in Cary Swoveland's answer
str.gsub(/[aeiou]/i, '') as shown in Avinash Raj's answer
Naming things
Cary Swoveland pointed out that vowel is not the best name for your method. Choose the names for your methods, variables and classes carefully. It's desirable to have a short and succinct method name, but it should also communicate its intent.
vowel(str) obviously has something to do with vowels, but it's not clear what it is. Does it return a vowel or all vowels from str? Does it check whether str is a vowel or contains a vowel?
remove_vowels or delete_vowels would probably be a better choice.
Same for variables: new is an array of characters. Why not call it characters (or chars if space is an issue)?
Bottom line: read the fine manual and get to know your tools. Most of the time, an IRB session is all you need to debug your code.
I should use regex.
str.gsub(/[aeiou]/i, "")
> string= "This Is my sAmple tExt to removE vowels"
#=> "This Is my sAmple tExt to removE vowels"
> string.delete 'aeiouAEIOU'
#=> "Ths s my smpl txt t rmv vwls"
You can create a method like this:
def remove_vowel(str)
result = str.delete 'aeiouAEIOU'
return result
end
remove_vowel("Hello World, This is my sample text")
# output : "Hll Wrld, Ths s my smpl txt"
Live Demo
Assuming you're trying to learn about the basics of programming, rather than finding the quickest one-liner to do this (which would be to use a regular expression as Avinash has said), you have a number of problems with your code you need to change.
new = str.split(" ")
This line is likely the culprit, because it splits the string based on spaces. So your input string would have to be "a p p l e" to have the effect you're looking for.
new = str.split("")
You should also remove the duplicate i = i+1 once you've changed that.
As others have already identified the problems with the OP's code, I will merely suggest an alternative; namely, you could use String#tr:
"Now is the time for all good people...".tr('aeiouAEIOU', '')
#=> "Nw s th tm fr ll gd ppl..."
If regex is not allowed, you can do it this way:
def remove_vowels(string)
string.split("").delete_if { |letter| %w[a e i o u].include? letter }.join
end
I need to get the expected output in ruby by using any method like scan or match.
Input string:
"http://test.com?t&r12=1&r122=1&r1=1&r124=1"
"http://test.com?t&r12=1&r124=1"
Expected:
r12=1,r122=1, r1=1, r124=1
r12=1,r124=1
How can I get the expected output using regex?
Use regex /r\d+=\d+/:
"http://test.com?t&r12=1&r122=1&r1=1&r124=1".scan(/r\d+=\d+/)
# => ["r12=1", "r122=1", "r1=1", "r124=1"]
"http://test.com?t&r12=1&r124=1".scan(/r\d+=\d+/)
# => ["r12=1", "r124=1"]
You can use join to get a string output. Here:
"http://test.com?t&r12=1&r122=1&r1=1&r124=1".scan(/r\d+=\d+/).join(',')
# => "r12=1,r122=1,r1=1,r124=1"
Update
If the URL contains other parameters that may include r in end, the regex can be made stricter:
a = []
"http://test.com?r1=2&r12=1&r122=1&r1=1&r124=1&ar1=2&tr2=3&xy4=5".scan(/(&|\?)(r+\d+=\d+)/) {|x,y| a << y}
a.join(',')
# => "r12=1,r122=1,r1=1,r124=1"
While input strings are urls with queries, I would safeguard myself from the false positives:
input = "http://test.com?t&r12=1&r122=1&r1=1&r124=1"
query_params = input.split('?').last.split('&')
#⇒ ["t", "r12=1", "r122=1", "r1=1", "r124=1"]
r_params = query_params.select { |e| e =~ /\Ar\d+=\d+/ }
#⇒ ["r12=1", "r122=1", "r1=1", "r124=1"]
r_params.join(',')
#⇒ "r12=1,r122=1,r1=1,r124=1"
It’s safer than just scan the original input for any regexp.
If you really need to do it with regex correctly, you'll need to use a regex like this:
puts "http://test.com?t&r12=1&r122=1&r1=1&r124=1".scan(/(?:http.*?\?t|(?<!^)\G)\&*(\br\d*=\d*)(?=.*$)/i).join(',')
puts "http://test.com?t&r12=1&r124=1".scan(/(?:http.*?\?t|(?<!^)\G)\&*(\br\d*=\d*)(?=.*$)/i).join(',')
Sample program output:
r12=1,r122=1,r1=1,r124=1
r12=1,r124=1
What would be a good way to remove the hash-tags from a string and then join the hash-tag words together in another string separated by commas:
'Some interesting tweet #hash #tags'
The result would be:
'Some interesting tweet'
And:
'hash,tags'
str = 'Some interesting tweet #hash #tags'
a,b = str.split.partition{|e| e.start_with?("#")}
# => [["#hash", "#tags"], ["Some", "interesting", "tweet"]]
a
# => ["#hash", "#tags"]
b
# => ["Some", "interesting", "tweet"]
a.join(",").delete("#")
# => "hash,tags"
b.join(" ")
# => "Some interesting tweet"
An alternate path is to use scan then remove the hash tags:
tweet = 'Some interesting tweet #hash #tags'
tags = tweet.scan(/#\w+/).uniq
tweet = tweet.gsub(/(?:#{ Regexp.union(tags).source })\b/, '').strip.squeeze(' ') # => "Some interesting tweet"
tags.join(',').tr('#', '') # => "hash,tags"
Dissecting it shows:
tweet.scan(/#\w+/) returns an array ["#hash", "#tags"].
uniq would remove any duplicated tags.
Regexp.union(tags) returns (?-mix:\#hash|\#tags).
Regexp.union(tags).source returns \#hash|\#tags. We don't want the pattern-flags at the start, so using source fixes that.
/(?:#{ Regexp.union(tags).source })\b/ returns the regular expression /(?:\#hash|\#tags)\b/.
tr is an extremely fast way to translate one character or characters to another, or strip them.
The final regex isn't the most optimized that can be generated. I'd actually write code to generate:
/#(?:hash|tags)\b/
but how to do that is left as an exercise for you. And, for short strings it won't make much difference as far as speed goes.
This has an array of hash that starts out empty
It then splits the hash tag based off spaces
It then looks for a hash tag and grabs the rest of the word
It then stores it into the array
array_of_hashetags = []
array_of_words = []
str = "Some interesting tweet #hash #tags"
str.split.each do |x|
if /\#\w+/ =~ x
array_of_hashetags << x.gsub(/\#/, "")
else
array_of_words << x
end
end
Hope the helps
I have a function that takes an input string and then runs the string through several regular expressions until it finds a match. Once a match is found, I return an output that is a function of the original string and the match. So in ruby:
str = "my very long original string ... millions of characters"
case str
when regex1
do_something1(str,$1)
when regex2
do_something2(str,$1)
when regex3
do_something3(str,$1)
...
when regex100000
do_something100000(str,$1)
else
do_something_else(str)
end
Now, is Ruby actually optimizing this switch loop so that str only gets traversed once? Assuming it isn't, then this functionality could be performed much more efficiently with one big long regular expression that had embedded callbacks. Something like this:
/(?<callback:do_something1>regex1)|
(?<callback:do_something2>regex2)|
(?<callback:do_something3>regex3)|
...
(?<callback:do_something100000>regex100000)/
Is there any technology that does this?
If Ruby 1.9, and if you use named groups in your regular expressions, then you can combine all of the regular expressions into one with a bit of trickery. Here's the class that does the heavy lifting:
class BigPatternMatcher
def initialize(patterns_and_functions, no_match_function)
#regex = make_big_regex(patterns_and_functions)
#no_match_function = no_match_function
end
def match(s, context)
match = #regex.match(s)
context.send(function_name(match), match)
end
private
FUNC_GROUP_PREFIX = "func_"
FUNC_GROUP_REGEX = /^#{FUNC_GROUP_PREFIX}(.*)$/
def function_name(match)
if match
match.names.grep(FUNC_GROUP_REGEX).find do |name|
match[name]
end[FUNC_GROUP_REGEX, 1]
else
#no_match_function
end
end
def make_big_regex(patterns_and_functions)
patterns = patterns_and_functions.map do |pattern, function|
/(?<#{FUNC_GROUP_PREFIX}#{function}>#{pattern.source})/
end
Regexp.union(patterns)
end
end
We'll come back to how this works. To use it, you'll need a list of regular expressions and the name of the function that should be invoked for each one. Make sure to use named groups only:
PATTERNS_AND_FUNCTIONS = [
[/ABC(?<value>\d+)/, :foo],
[/DEF(?<value>\d+)/, :bar],
]
And the functions, including one that gets called when there's no match:
def foo(match)
p ["foo", match[:value]]
end
def bar(match)
p ["bar", match[:value]]
end
def default(match)
p ["default"]
end
And, finally, here's how it's used. BigPatternMatcher#match takes the string to match, and the object on which the function should be called:
matcher = BigPatternMatcher.new(PATTERNS_AND_FUNCTIONS, :default)
matcher.match('blah ABC1 blah', self) # => ["foo", "1"
matcher.match('blah DEF2 blah', self) # => ["bar", "2"]
matcher.match('blah blah', self) # => ["default"]
See below the fold for the trickery that makes it work.
BigPatternMatcher#make_big_regex combines all of the regular expressions into one, each surrounded in parentheses and separated by |, e.g.
/(ABC(?<value>\\d+))|(DEF(?<value>\\d+))/
That isn't enough, though. We need some way, when one of the sub-expression matches, to identify which one matched, and therefore which function to call. To do that, we'll make each sub-expression its own named group, with the name based on the function that should be called:
/(?<func_foo>ABC(?<value>\\d+))|(?<func_bar>DEF(?<value>\\d+))/
Now, let's see how we get from a match to calling a function. Given the string:
"blah ABC1 blah"
Then the match group is:
#<MatchData "ABC1" func_foo:"ABC1" value:"1" func_bar:nil value:nil>
To figure out which function to call, we just have to find the match with a name that starts with "func_" and which has a non-nil value. The part of the group's name after "func_" names the function to call.
Measuring the performance of this against the straightforward technique in your question is an exercise for the reader.