Regexp for finding href in <a> open-uri ruby - ruby

I need to find distance between two websites useing ruby open-uri. Using
def check(url)
site = open(url.base_url)
link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$}
site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)}
p url.links
end
Finding links not working properly. Any ideas why ?

If you want to find the a tags' href parameters, use the right tool, which isn't often a regex. More likely you should use a HTML/XML parser.
Nokogiri is the parser of choice with Ruby:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }
pp doc.search('a').map{ |a| a['href'] }
# => [
# => "/",
# => "/domains/",
# => "/numbers/",
# => "/protocols/",
# => "/about/",
# => "/go/rfc2606",
# => "/about/",
# => "/about/presentations/",
# => "/about/performance/",
# => "/reports/",
# => "/domains/",
# => "/domains/root/",
# => "/domains/int/",
# => "/domains/arpa/",
# => "/domains/idn-tables/",
# => "/protocols/",
# => "/numbers/",
# => "/abuse/",
# => "http://www.icann.org/",
# => "mailto:iana#iana.org?subject=General%20website%20feedback"
# => ]

I see several issues with this regular expression:
It is not necessarily the case that a space must come before the trailing slash in an empty tag, yet your regexp requires it
Your regexp is very verbose and redundant
Try the following instead, it will extract you the URL out of <a> tags:
link = /<a \s # Start of tag
[^>]* # Some whitespace, other attributes, ...
href=" # Start of URL
([^"]*) # The URL, everything up to the closing quote
" # The closing quotes
/x # We stop here, as regular expressions wouldn't be able to
# correctly match nested tags anyway

Related

Ruby regex to extract match_group value?

I have two questions about regex.
The match string is:
"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"
When extracting the user_email value, my regexp is:
\s+(?<email_from_header>\S+)
and the match group value is:
(space)user_email=admin#example.com"
What do I use to omit the first (space) char and the last " quote?
When extracting the token, my regex is:
AUTH-TOKEN\s+(?<auth_token>\S+)
and the match group value is:
FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA,
What do I use to delete that last trailing comma ,?
Your regex would be,
\s+\K(?<email_from_header>[^"]*)
Use \K switch to discard the previously matched characters. And use not character class to match any character not of " zero or more times.
Your regex would be,
AUTH-TOKEN\s+(?<auth_token>[^,]*)
[^,]* it would match any character not of , zero or more times.
If your string has embedded double-quotes:
str[/^"(.+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^"(.+?),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^"([^,]+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str = '"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"'
str # => "\"FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com\""
str[/(user_email=.+)"/, 1] # => "user_email=admin#example.com"
str[/(user_email=[^"]+)"/, 1] # => "user_email=admin#example.com"
str[/user_email=([^"]+)"/, 1] # => "admin#example.com"
match = str.match(/(?<user_email>user_email=(?<addr>.+))"/)
match # => #<MatchData "user_email=admin#example.com\"" user_email:"user_email=admin#example.com" addr:"admin#example.com">
match['user_email'] # => "user_email=admin#example.com"
match['addr'] # => "admin#example.com"
If it doesn't:
str = 'FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com'
str # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA, user_email=admin#example.com"
str[/^(.+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^(.+?),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/^([^,]+),/, 1] # => "FuR6UcUiduzPyenxCSzZbDXTge3f3t9ufA"
str[/(user_email=.+)/, 1] # => "user_email=admin#example.com"
str[/(user_email=(.+))/, 2] # => "admin#example.com"
str[/user_email=(.+)/, 1] # => "admin#example.com"
Or, having more regex fun:
match = str.match(/(?<user_email>user_email=(?<addr>.+))/)
match # => #<MatchData "user_email=admin#example.com" user_email:"user_email=admin#example.com" addr:"admin#example.com">
match['user_email'] # => "user_email=admin#example.com"
match['addr'] # => "admin#example.com"
Regular expressions are a very rich language, and you can write something in many ways usually. The problem then becomes maintaining the pattern as the program "matures". I recommend starting simply, and expanding the pattern as the needs dictate. Don't start complex hoping to find a working solution, because that usually doesn't work; Getting a complex pattern to work immediately often fails.

how to remove backslash from a string containing an array in ruby

I have a string like this
a="[\"6000208900\",\"600020890225\",\"600900231930\"]"
#expected result [6000208900,600020890225,600900231930]
I am trying to remove the backslash from the string.
a.gsub!(/^\"|\"?$/, '')
Inside the double quoted string(""), another double quotes must be escaped by \. You can't remove it.
Use puts, you can see it is not there.
a = "[\"6000208902912790\"]"
puts a # => ["6000208902912790"]
Or use JSON
irb(main):001:0> require 'json'
=> true
irb(main):002:0> a = "[\"6000208902912790\"]"
=> "[\"6000208902912790\"]"
irb(main):003:0> b = JSON.parse a
=> ["6000208902912790"]
irb(main):004:0> b
=> ["6000208902912790"]
irb(main):005:0> b.to_s
=> "[\"6000208902912790\"]"
update (as per the last edit of OP)
irb(main):002:0> a = "[\"6000208900\",\"600020890225\",\"600900231930\"]"
=> "[\"6000208900\",\"600020890225\",\"600900231930\"]"
irb(main):006:0> a.scan(/\d+/).map(&:to_i)
=> [6000208900, 600020890225, 600900231930]
irb(main):007:0>
The code a.gsub!(/^\"|\"?$/, '') can't remove the double quote characters because they are not at the beginning and the end of the string. To get what you want try this:
a.gsub(/((?<=^\[)")|("(?=\]$))/, '')
try this:
=> a = "[\"6000208902912790\"]"
=> a.chars.select{ |x| x =~ %r|\d| }.join
=> "6000208902912790"
=> [a.chars.select { |x| x =~ %r|\d| }.join]
=> ["6000208902912790"] # <= array with string
=> [a.chars.select { |x| x =~ %r|\d| }.join].to_s
=> "[\"6000208902912790\"]" # <= come back :)
a="["6000208902912790"]" will return `unexpected tINTEGER`error;
so a="[\"6000208902912790\"]"is used with \ character for double quotes.
As a solution you should try to remove double quotes that will solve the problem.
Do this
a.gsub!(/"/, '')

How do I add a parameter to a URL?

I'm trying to open multiple HTML documents. The URL for each site looks like this:
http://www.website.com/info/state=AL
AL is Alabama, but it changes by the state. I can create an array with all the two letter combinations state=('aa'..'zz').to_a, but how can I input this into the parameter were AL is above?
I want it to pull up the HTML document for all two letter combinations, and from there I can use a conditional to weed out the ones I don't want. But how should I go about inserting the two letter combinations?
Ruby's URI class is useful. It's not the most full-featured package for handling URLs out there -- check out Addressable::URI if you need more, but it's good:
require 'uri'
uri = URI.parse('http://www.website.com/info')
{
'Alabama' => 'AL',
'Alaska' => 'AK',
'Arizona' => 'AZ',
'Arkansas' => 'AR',
'California' => 'CA',
}.each_pair do |k, v|
uri.query = URI.encode_www_form( {'state' => v} )
puts uri.to_s
end
Which outputs:
http://www.website.com/info?state=AL
http://www.website.com/info?state=AK
http://www.website.com/info?state=AZ
http://www.website.com/info?state=AR
http://www.website.com/info?state=CA
Or:
%w[AL AK AZ AR CA].each do |s|
uri.query = URI.encode_www_form( {'state' => s} )
puts uri.to_s
end
Which outputs the same thing.

Squeeze double char in Ruby

What is the best way to squeeze multicharacter in string ?
Example:
hahahahahaha => ha
lalalala => la
awdawdawdawd => awd
str.squeeze("ha") # doesn't work
str.tr("haha", "ha") # doesn't work
def squeeze(s)
s.gsub(/(.+?)\1+/, '\1')
end
puts squeeze('hahahaha') # => 'ha'
puts squeeze('awdawdawd') # => 'awd'
puts squeeze('hahahaha something else') # => 'ha something else'
You can use regex based search and replace:
str.gsub(/(ha)+/, 'ha')

ruby regex: how to get group value

Here is my regex:
s = /(?<head>http|https):\/\/(?<host>[^\/]+)/.match("http://www.myhost.com")
How do I get the head and host groups?
s['head'] => "http"
s['host'] => "www.myhost.com"
You could also use URI...
1.9.3p327 > require 'uri'
=> true
1.9.3p327 > u = URI.parse("http://www.myhost.com")
=> #<URI::HTTP:0x007f8bca2239b0 URL:http://www.myhost.com>
1.9.3p327 > u.scheme
=> "http"
1.9.3p327 > u.host
=> "www.myhost.com"
Use captures >>
string = ...
one, two, three = string.match(/pattern/).captures
You should probably use the uri library for this purpose as suggested above, but whenever you match a string to a regex, you can grab captured values using the special variable:
"foo bar baz" =~ /(bar)\s(baz)/
$1
=> 'bar'
$2
=> 'baz'
and so on...

Resources