How to escape two forward slashes in Ruby Regex [closed] - ruby

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to make a regex which finds the domain name using Ruby, so I tried this:
(?<=.*/).(?=.*/)
On Rubular I always see this error message: Forward slashes must be escaped.
How do I solve this?

When you use the // regex literal, you need to escape / using a backslash as \/. When you want literal / in your regex, it is usually simpler to avoid using the // literal. For example, use %r literal with any delimiters that would not cause conflict.
%r{/}
By the way, Ruby onigmo regex engine does not allow look behind with variable length, so your regex will return an error anyway.

Don't reinvent wheels, especially ones that work:
require 'uri'
URI.split('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
# => ["http",
# "user:passwd",
# "www.example.com",
# "81",
# nil,
# "/path/to/index.html",
# nil,
# "foo=bar",
# "baz"]
Or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
uri.authority # => "user:passwd#www.example.com:81"
uri.fragment # => "baz"
uri.host # => "www.example.com"
uri.password # => "passwd"
uri.path # => "/path/to/index.html"
uri.port # => 81
uri.query # => "foo=bar"
uri.query_values # => {"foo"=>"bar"}
uri.scheme # => "http"
uri.to_hash # => {:scheme=>"http", :user=>"user", :password=>"passwd", :host=>"www.example.com", :port=>81, :path=>"/path/to/index.html", :query=>"foo=bar", :fragment=>"baz"}
uri.user # => "user"
Between the two, Addressable::URI is more full-featured and follows the specs very closely. Ruby's built-in URI is good for lighter-weight lifting.
Root around in their code and you'll find the regular expressions used to tear apart a URL; You'll also see that they aren't trivial because URLs can be quite "interesting", where "interesting" means you'll scream and pull out your hair. See the URI RFC for more information. See "Parsing a URI Reference with a Regular Expression" in that document for a suggested pattern.
...I do a exercise from codewars and I not allowed to use require
First, if so, why are you asking for help on how to write this? You are supposed to figure these things out yourself.
That said, try what has already been created. This uses the pattern in the RFC:
URI_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX).captures # !> assigned but unused variable - port
# => ["http:",
# "http",
# "//user:passwd#www.example.com:81",
# "user:passwd#www.example.com:81",
# "/path/to/index.html",
# "?foo=bar",
# "foo=bar",
# "#baz",
# "baz"]
user, passwd, host, port = uri_captures[3].split(/[:#]/)
host # => "www.example.com"
For further convenience, here's a simple pattern to provide named captures:
URI_REGEX = %r!^((?<scheme>[^:/?#]+):)?(//(?<authority>[^/?#]*))?(?<path>[^?#]*)(\?(?<query>[^#]*))?(?<fragment>#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX)
authority_captures = uri_captures['authority'].match(/(?<user>[^:]+)?:?(?<passwd>[^#]+)?#?(?<host>.+)(:(?<port>\d+)?)/)
authority_captures['host']
# => "www.example.com"

Related

regex to check url format [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to check if my url format is correct, it has some AWS acces keys etc:
/https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=[.+]&Expires=[.+]&Signature=[.+]/.match(url)
^ something like this. Could you please help?
URI RFC specifies this regular expression for parsing URLs and URIs:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
You can also use the URI module from Ruby standard library:
require 'uri'
if url =~ /^#{URI::regexp(%w(http https))}$/
puts "it's an url alright"
else
puts "that's no url, that's a spaceship"
end
To check for the existence of "some AWS access keys etc" you can do:
require 'uri'
uri = URI.parse(url)
params = URI.decode_www_form(uri.query).to_h
if params.has_key?('AWSAccessKeyId')
unless params['AWSAccessKeyId'] =~ /\A[a-f0-9]{32}\z/
abort 'AWSAccessKeyId not valid'
end
else
abort 'AWSAccessKeyId required'
end
Of course you can just use regular expressions to parse them directly but it gets ugly because the order of the parameters may be different:
>> url = "https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=abcd12345&Expires=12345678&Signature=abcd"
>> matchdata = url.match(
/
\A
(?<scheme>http(?:s)?):\/\/
(?<host>[^\/]+)
(?<path>\/.+)\?
(?=.*(?:[\?\&]|\b)AWSAccessKeyId\=(?<aws_access_key_id>[a-f0-9]{1,32}))
(?=.*(?:[\?\&]|\b)Expires=(?<expires>[0-9]+))
/x
)
=> #<MatchData "https://bucket.s3.amazonaws.com/path/file.txt?"
scheme:"https"
host:"bucket.s3.amazonaws.com"
path:"/path/file.txt"
aws_access_key_id:"abcd12345"
expires:"12345678">
>> matchdata[:aws_access_key_id]
# => "abcd12345"
This uses
The positive lookahead of regex : (?=..) to ignore parameter
order
Ruby's regex named captures (?<param_name>.*) to identify
the params from match data
Non capturing groupings (?abcd|efgh)
The matcher (?[\&\?]|\b) to handle Expires=..., ?Expires=... or &Expires=...
And finally the /x free spacing modifier to
allow nicer formatting
We need a url to work with:
url = "/https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=somestuff&Expires=somemorestuff&Signature=evenmorestuff"
We also need to escape a bunch of stuff and do some non-greedy matching(.+?):
/https:\/\/bucket.s3.amazonaws.com\/path\/file\.txt\?AWSAccessKeyId=.+?&Expires=.+?&Signature=.+/.match(url)
=> #<MatchData "https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=somestuff&Expires=somemorestuff&Signature=evenmorestuff">

How do you parse this string?

/events/3122671255551936/?ref=br_rs&action_history=null
I would just like to extract the number after '/events/' and before '/?ref=br_rs...
\
You could split it by the / character:
irb(main):003:0> "/events/3122671255551936/?ref=br_rs&action_history=null".split("/")[2]
=> "3122671255551936"
You can also use String#scan method to grab the digits:
"/events/3122671255551936/?ref=br_rs&action_history=null".scan(/\d+/).join
# => "3122671255551936"
If your string is str:
x = str["/events/".size..-1].to_i
#=> 3122671255551936
If you want the string:
x.to_s
#=> "3122671255551936"
You're looking at the path from a URL. A basic split will work initially:
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "3122671255551936"
There are existing tools to make this easy and that will handle encoding and decoding of special characters during processing of the URL:
require 'uri'
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
scheme, userinfo, host, port, registry, path, opaque, query, fragment = URI.split(str)
scheme # => nil
userinfo # => nil
host # => nil
port # => nil
registry # => nil
path # => "/events/3122671255551936/"
opaque # => nil
query # => "ref=br_rs&action_history=null"
fragment # => nil
uri = URI.parse(str)
path accesses the path component of the URL:
uri.path # => "/events/3122671255551936/"
Making it easy to grab the value:
uri.path.split('/')[2] # => "3122671255551936"
Now, imagine if that URL had a scheme and host like "http://www.example.com/" prepended, as most URLs do. (Having written hundreds of spiders and scrapers, I know how easy it is to encounter such a change.) Using a naive split('/') would immediately break:
str = 'http://www.example.com/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "www.example.com"
That means any solution relying on split alone would break, along with any others that try to locate the position of the value based on the entire string.
But using the tools designed for the job the code would continue working:
uri = URI.parse(str)
uri.path.split('/')[2] # => "3122671255551936"
Notice how simple and easy to read it is, which will transfer to being easier to maintain. It could even be simplified to:
URI.parse(str).path.split('/')[2] # => "3122671255551936"
and continue to work.
This is because URL/URI are an agreed-upon standard, making it possible to write a parser to take apart, and build, a string that conforms to the standard.
See the URI documentation for more information.

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

regex to remove the webpage part of a url in ruby

I am trying to remove the webpage part of the URL
For example,
www.example.com/home/index.html
to
www.example.com/home
any help appreciated.
Thanks
It's probably a good idea not to use regular expressions when possible. You may summon Cthulhu. Try using the URI library that's part of the standard library instead.
require "uri"
result = URI.parse("http://www.example.com/home/index.html")
result.host # => www.example.com
result.path # => "/home/index.html"
# The following line is rather unorthodox - is there a better solution?
File.dirname(result.path) # => "/home"
result.host + File.dirname(result.path) # => "www.example.com/home"
If your heart is set on using regex and you know that your URLs will be pretty straight forward you could use (.*)/.* to capture everything before the last / in your URL.
irb(main):007:0> url = "www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):008:0> regex = "(.*)/.*"
=> "(.*)/.*"
irb(main):009:0> url =~ /#{regex}/
=> 0
irb(main):010:0> $1
=> "www.example.com/home"
irb(main):001:0> url="www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):002:0> url.split("/")[0..-2].join("/")
=> "www.example.com/home"

In Ruby, how do I replace the question mark character in a string?

In Ruby, I have:
require 'uri'
foo = "et tu, brutus?"
bar = URI.encode(foo) # => "et%20tu,%20brutus?"
I'm trying to get bar to equal "et%20tu,%20brutus%3f" ("?" replaced with "%3F") When I try to add this:
bar["?"] = "%3f"
the "?" matches everything, and I get
=> "%3f"
I've tried
bar["\?"]
bar['?']
bar["/[?]"]
bar["/[\?]"]
And a few other things, none of which work.
require 'cgi' and call CGI.escape
There is only one good way to do this right now in Ruby:
require "addressable/uri"
Addressable::URI.encode_component(
"et tu, brutus?",
Addressable::URI::CharacterClasses::PATH
)
# => "et%20tu,%20brutus%3F"
But if you're doing stuff with URIs you should really be using Addressable anyways.
sudo gem install addressable
Here's a sample irb session:
irb(main):001:0> x = "geo?"
=> "geo?"
irb(main):002:0> x.sub!("?","a")
=> "geoa"
irb(main):003:0>
However, sub will only replace the first character. If you want to replace all the question marks in a string, use the gsub method like this:
str.gsub!("?","replacement")
If you know which characters you accept, you can remove those that don't match.
accepted_chars = 'A-z0-9\s,'
foo = "et tu, brutus?"
bar = foo.gsub(/[^#{accepted_chars}]/, '')
URI.escape accepts the optional parameter to tell which characters you want to escape. It overrides defaults so you'll have to call it twice.
> URI.escape URI.escape("et tu, brutus?"), "?"
=> "et%20tu,%20brutus%3F"

Resources