Extracting URLs (to array) in Ruby - ruby

Good afternoon,
I'm learning about using RegEx's in Ruby, and have hit a point where I need some assistance.
I am trying to extract 0 to many URLs from a string.
This is the code I'm using:
sStrings = ["hello world: http://www.google.com", "There is only one url in this string http://yahoo.com . Did you get that?", "The first URL in this string is http://www.bing.com and the second is http://digg.com","This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1", "This string contains no urls"]
sStrings.each do |s|
x = s.scan(/((http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.[\w-]*)?)/ix)
x.each do |url|
puts url
end
end
This is what is returned:
http://www.google.com
http
.google
nil
nil
http://yahoo.com
http
nil
nil
nil
http://www.bing.com
http
.bing
nil
nil
http://digg.com
http
nil
nil
nil
http://is.gd/12345
http
nil
/12345
nil
http://is.gd/4567
http
nil
/4567
nil
What is the best way to extract only the full URLs and not the parts of the RegEx?

You could use anonymous capture groups (?:...) instead of (...).
I see that you are doing this in order to learn Regex, but in case you really want to extract URLs from a String, take a look at URI.extract, which extracts URIs from a String. (require "uri" in order to use it)

You can create a non-capturing group using (?:SUB_PATTERN). Here's an illustration, with some additional simplifications thrown in. Also, since you're using the /x option, take advantage of it by laying out your regex in a readable way.
sStrings = [
"hello world: http://www.google.com",
"There is only one url in this string http://yahoo.com . Did you get that?",
"... is http://www.bing.com and the second is http://digg.com",
"This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1",
"This string contains no urls",
]
sStrings.each do |s|
x = s.scan(/
https?:\/\/
\w+
(?: [.-]\w+ )*
(?:
\/
[0-9]{1,5}
\?
[\w=]*
)?
/ix)
p x
end
This is fine for learning, but don't really try to match URLs this way. There are tools for that.

Related

Swap part of a string in Ruby

What's the easiest way in Ruby to interchange a part of a string with another value. Let's say that I have an email, and I want to check it on two domains, but I don't know which one I'll get as an input. The app I'm building should work with #gmail.com and #googlemail.com domains.
Example:
swap_string 'user#gmail.com' # >>user#googlemail.com
swap_string 'user#googlemail.com' # >>user#gmail.com
If you're looking to substitute a part of a string with something else, gsub works quite well.
Link to Gsub docs
It lets you match a part of a string with regex, and then substitute just that part with another string. Naturally, in place of regex, you can just use a specific string.
Example:
"user#gmail.com".gsub(/#gmail/, '#googlemail')
is equal to
user#googlemail.com
In my example I used #gmail and #googlemail instead of just gmail and googlemail. The reason for this is to make sure it's not an account with gmail in the name. It's unlikely, but could happen.
Don't match the .com either, as that can change depending on where the user's email is.
Assuming googlemail.com and gmail.com are the only two possibilities, you can use sub to replace a pattern with given replacement:
def swap_string(str)
if str =~ /gmail.com$/
str.sub("gmail.com","googlemail.com")
else
str.sub("googlemail.com","gmail.com")
end
end
swap_string 'user#gmail.com'
# => "user#googlemail.com"
swap_string 'user#googlemail.com'
# => "user#gmail.com"
You can try with Ruby gsub :
eg:
"user#gmail.com".gsub("gmail.com","googlemail.com");
As per your need of passing a string parameter in a function this should do:
def swap_mails(str)
if str =~ /gmail.com$/
str.sub('gmail.com','googlemail.com');
else
str.sub('googlemail.com','gmail.com');
end
end
swap_mails "vgmail#gmail.com" //vgmail#googlemail.com
swap_mails "vgmail#googlemail.com" ////vgmail#gmail.com
My addition :
def swap_domain str
str[/.+#/] + [ 'gmail.com', 'googlemail.com' ].detect do |d|
d != str.split('#')[1]
end
end
swap_domain 'user#gmail.com'
#=> user#googlemail.com
swap_domain 'user#googlemail.com'
#=> user#gmail.com
And this is bad code, imo.
String has a neat trick up it's sleeve in the form of String#[]:
def swap_string(string, lookups = {})
string.tap do |s|
lookups.each { |find, replace| s[find] = replace and break if s[find] }
end
end
# Example Usage
lookups = {"googlemail.com"=>"gmail.com", "gmail.com"=>"googlemail.com"}
swap_string("user#gmail.com", lookups) # => user#googlemail.com
swap_string("user#googlemail.com", lookups) # => user#gmail.com
Allowing lookups to be passed to your method makes it more reusable but you could just as easily have that hash inside of the method itself.

Ruby Regex: negative lookahead with unlimited matching before

I'm trying to be able to match a phrase like:
I request a single car
// or
I request a single person
// or
I request a single coconut tree
but not
I request a single car by id
// nor
I request a single person by id with friends
// nor
I request a single coconut tree by id with coconuts
Something like this works:
/^I request a single person(?!\s+by id.*)/
for strings like this:
I request a single person
I request a single person with friends
But when I replace the person with a matcher (.*) or add the $ to the end, it stops working:
/^I request a single (.*)(?!\s+by id.*)$/
How can I accomplish this but still match in the first match everything before the negative lookahead?
There's no ) to match ( in (.*\). Perhaps that's a typo, since you tested. After fixing that, however, there's still a problem:
"I request a single car by idea" =~ /^I request a single (?!.*by id.*)(.*)$/
#=> nil
Presumably, that should be a match. If you only want to know if there's a match, you can use:
r = /^I request a single (?!.+?by id\b)/
Then:
"I request a single car by idea" =~ r #=> 0
"I request a single person by id with friends" =~ r #=> nil
\b matches a word break, which includes the case where the previous character is the last one in the string. Notice that if you are just checking for a match, there's no need to include anything beyond the negative lookahead.
If you want to return whatever follows "single " when there's a match, use:
r = /^I request a single (?!.+?by id\b)(.*)/
"I request a single coconut tree"[r,1] #=> "coconut tree"
"I request a single person by id with friends"[r,1] #=> nil
OK, I think I just got it. Right after asking the question. Instead of a creating lookahead after the thing I want to capture, I create a lookahead before the thing I want to capture, like so:
/^I request a single (?!.*by id.*)(.*[^\s])?\s*$/

Optimising ruby regexp -- lots of match groups

I'm working on a ruby baser lexer. To improve performance, I joined up all tokens' regexps into one big regexp with match group names. The resulting regexp looks like:
/\A(?<__anonymous_-1038694222803470993>(?-mix:\n+))|\A(?<__anonymous_-1394418499721420065>(?-mix:\/\/[\A\n]*))|\A(?<__anonymous_3077187815313752157>(?-mix:include\s+"[\A"]+"))|\A(?<LET>(?-mix:let\s))|\A(?<IN>(?-mix:in\s))|\A(?<CLASS>(?-mix:class\s))|\A(?<DEF>(?-mix:def\s))|\A(?<DEFM>(?-mix:defm\s))|\A(?<MULTICLASS>(?-mix:multiclass\s))|\A(?<FUNCNAME>(?-mix:![a-zA-Z_][a-zA-Z0-9_]*))|\A(?<ID>(?-mix:[a-zA-Z_][a-zA-Z0-9_]*))|\A(?<STRING>(?-mix:"[\A"]*"))|\A(?<NUMBER>(?-mix:[0-9]+))/
I'm matching it to my string producing a MatchData where exactly one token is parsed:
bigregex =~ "\n ... garbage"
puts $~.inspect
Which outputs
#<MatchData
"\n"
__anonymous_-1038694222803470993:"\n"
__anonymous_-1394418499721420065:nil
__anonymous_3077187815313752157:nil
LET:nil
IN:nil
CLASS:nil
DEF:nil
DEFM:nil
MULTICLASS:nil
FUNCNAME:nil
ID:nil
STRING:nil
NUMBER:nil>
So, the regex actually matched the "\n" part. Now, I need to figure the match group where it belongs (it's clearly visible from #inspect output that it's _anonymous-1038694222803470993, but I need to get it programmatically).
I could not find any option other than iterating over #names:
m.names.each do |n|
if m[n]
type = n.to_sym
resolved_type = (n.start_with?('__anonymous_') ? nil : type)
val = m[n]
break
end
end
which verifies that the match group did have a match.
The problem here is that it's slow (I spend about 10% of time in the loop; also 8% grabbing the #input[#pos..-1] to make sure that \A works as expected to match start of string (I do not discard input, just shift the #pos in it).
You can check the full code at GH repo.
Any ideas on how to make it at least a bit faster? Is there any option to figure the "successful" match group easier?
You can do this using the regexp methods .captures() and .names():
matching_string = "\n ...garbage" # or whatever this really is in your code
#input = matching_string.match bigregex # bigregex = your regex
arr = #input.captures
arr.each_with_index do |value, index|
if not value.nil?
the_name_you_want = #input.names[index]
end
end
Or if you expect multiple successful values, you could do:
success_names_arr = []
success_names_arr.push(#input.names[index]) #within the above loop
Pretty similar to your original idea, but if you're looking for efficiency .captures() method should help with that.
I may have misunderstood this completely but but I'm assuming that all but one token is not nil and that's the one your after?
If so then, depending on the flavour of regex you're using, you could use a negative lookahead to check for a non-nil value
([^\n:]+:(?!nil)[^\n\>]+)
This will match the whole token ie NAME:value.

Ruby: check if object is nil

def parse( line )
_, remote_addr, status, request, size, referrer, http_user_agent, http_x_forwarded_for = /^([^\s]+) - (\d+) \"(.+)\" (\d+) \"(.*)\" \"([^\"]*)\" \"(.*)\"/.match(line).to_a
print line
print request
if request && request != nil
_, referrer_host, referrer_url = /^http[s]?:\/\/([^\/]+)(\/.*)/.match(referrer).to_a if referrer
method, full_url, _ = request.split(' ')
in parse: private method 'split' called for nil:NilClass (NoMethodError)
So as i understand it's calling split not on a string, but on nil.
This part is parsing web server log. But I can't understand why it's getting nil. As I understand it's null.
Some of the subpatterns in regex failed? So it's the webserver's fault, which sometimes generates wrong logging strings?
By the way how do I write to file in ruby? I can't read properly in this cmd window under windows.
You seem to have a few questions here, so I'll take a stab at what seems to be the main one:
If you want to see if something is nil, just use .nil? - so in your example, you can just say request.nil?, which returns true if it is nil and false otherwise.
Ruby 2.3.0 added a safe navigation operator (&.) that checks for nil before calling a method.
request&.split(' ')
This is functionally* equivalent to
!request.nil? && request.split(' ')
*(They are slightly different. When request is nil, the top expression evaluates to nil, while the bottom expression evaluates to false.)
To write to a file:
File.open("file.txt", "w") do |file|
file.puts "whatever"
end
As I write in a comment above - you didn't say what is nil. Also, check whether referrer contains what you think it contains. EDIT I see it's request that is nil. Obviously, regexp trouble.
Use rubular.com to easily test your regexp. Copy a line from your input file into "Your test string", and your regexp into "Your regular expression", and tweak until you get a highlight in "Match result".
Also, what are "wrong logging strings"? If we're talking Apache, log format is configurable.

How could I check to see if a word exists in a string, and return false if it doesn't, in ruby?

Say I have a string str = "Things to do: eat and sleep."
How could I check if "do: " exists in str, case insensitive?
Like this:
puts "yes" if str =~ /do:/i
To return a boolean value (from a method, presumably), compare the result of the match to nil:
def has_do(str)
(str =~ /do:/i) != nil
end
Or, if you don’t like the != nil then you can use !~ instead of =~ and negate the result:
def has_do(str)
not str !~ /do:/i
end
But I don’t really like double negations …
In ruby 1.9 you can do like this:
str.downcase.match("do: ") do
puts "yes"
end
It's not exactly what you asked for, but I noticed a comment to another answer. If you don't mind using regular expressions when matching the string, perhaps there is a way to skip the downcase part to get case insensitivity.
For more info, see String#match
You could also do this:
str.downcase.include? "Some string".downcase
If all I'm looking for is a case=insensitive substring match I usually use:
str.downcase['do: ']
9 times out of 10 I don't care where in the string the match is, so this is nice and concise.
Here's what it looks like in IRB:
>> str = "Things to do: eat and sleep." #=> "Things to do: eat and sleep."
>> str.downcase['do: '] #=> "do: "
>> str.downcase['foobar'] #=> nil
Because it returns nil if there is no hit it works in conditionals too.
"Things to do: eat and sleep.".index(/do: /i)
index returns the position where the match starts, or nil if not found
You can learn more about index method here:
http://ruby-doc.org/core/classes/String.html
Or about regex here:
http://www.regular-expressions.info/ruby.html

Resources