regex to check url format [closed] - ruby

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to check if my url format is correct, it has some AWS acces keys etc:
/https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=[.+]&Expires=[.+]&Signature=[.+]/.match(url)
^ something like this. Could you please help?

URI RFC specifies this regular expression for parsing URLs and URIs:
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
You can also use the URI module from Ruby standard library:
require 'uri'
if url =~ /^#{URI::regexp(%w(http https))}$/
puts "it's an url alright"
else
puts "that's no url, that's a spaceship"
end
To check for the existence of "some AWS access keys etc" you can do:
require 'uri'
uri = URI.parse(url)
params = URI.decode_www_form(uri.query).to_h
if params.has_key?('AWSAccessKeyId')
unless params['AWSAccessKeyId'] =~ /\A[a-f0-9]{32}\z/
abort 'AWSAccessKeyId not valid'
end
else
abort 'AWSAccessKeyId required'
end
Of course you can just use regular expressions to parse them directly but it gets ugly because the order of the parameters may be different:
>> url = "https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=abcd12345&Expires=12345678&Signature=abcd"
>> matchdata = url.match(
/
\A
(?<scheme>http(?:s)?):\/\/
(?<host>[^\/]+)
(?<path>\/.+)\?
(?=.*(?:[\?\&]|\b)AWSAccessKeyId\=(?<aws_access_key_id>[a-f0-9]{1,32}))
(?=.*(?:[\?\&]|\b)Expires=(?<expires>[0-9]+))
/x
)
=> #<MatchData "https://bucket.s3.amazonaws.com/path/file.txt?"
scheme:"https"
host:"bucket.s3.amazonaws.com"
path:"/path/file.txt"
aws_access_key_id:"abcd12345"
expires:"12345678">
>> matchdata[:aws_access_key_id]
# => "abcd12345"
This uses
The positive lookahead of regex : (?=..) to ignore parameter
order
Ruby's regex named captures (?<param_name>.*) to identify
the params from match data
Non capturing groupings (?abcd|efgh)
The matcher (?[\&\?]|\b) to handle Expires=..., ?Expires=... or &Expires=...
And finally the /x free spacing modifier to
allow nicer formatting

We need a url to work with:
url = "/https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=somestuff&Expires=somemorestuff&Signature=evenmorestuff"
We also need to escape a bunch of stuff and do some non-greedy matching(.+?):
/https:\/\/bucket.s3.amazonaws.com\/path\/file\.txt\?AWSAccessKeyId=.+?&Expires=.+?&Signature=.+/.match(url)
=> #<MatchData "https://bucket.s3.amazonaws.com/path/file.txt?AWSAccessKeyId=somestuff&Expires=somemorestuff&Signature=evenmorestuff">

Related

Remove all but the website name from URL in Ruby [duplicate]

This question already has answers here:
How to parse a URL and extract the required substring
(4 answers)
Closed 5 years ago.
Im a iterating through a list of URLs. The urls come in different formats like:
https://twitter.com/sdfaskj...
https://www.linkedin.com/asdkfjasd...
http://google.com/asdfjasdj...
etc.
I would like to use Gsub or something similar to erase everything but the name of the website, to get only "twitter", "linkedin", and "google", respectively.
In my head, ideally I would like something like a .gsub that can check for multiple possibilities (url.gsub("https:// or https://www. or http:// etc.", "") and replace them when found with nothing "". Also it needs to delete everything after the name, so ".com/wkadslflj..."
attributes.css("a").each do |attribute|
attribute_url = attribute["href"]
attribute_scrape = attribute_url.gsub("https://", "")
binding.pry
end
I would consider a combination of URI.parse to get the hostname from the URL and the PublicSuffix gem to get the second level domain:
require 'public_suffix'
require 'uri'
url = 'https://www.linkedin.com/asdkfjasd'
host = URI.parse(url).host # => 'www.linkedin.com'
PublicSuffix.parse(host).sld # => 'linkedin'
You can use this gsub regexp :
gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '')
Output:
list = ["https://twitter.com/sdfaskj...", "https://www.linkedin.com/asdkfjasd...", "http://google.com/asdfjasdj..."]
list.map { |u| u.gsub(/http(s)?:\/\/(www.)?|.(com|net|co.uk|us)+.*/, '') }
=> ["twitter", "linkedin", "google"]

How to escape two forward slashes in Ruby Regex [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to make a regex which finds the domain name using Ruby, so I tried this:
(?<=.*/).(?=.*/)
On Rubular I always see this error message: Forward slashes must be escaped.
How do I solve this?
When you use the // regex literal, you need to escape / using a backslash as \/. When you want literal / in your regex, it is usually simpler to avoid using the // literal. For example, use %r literal with any delimiters that would not cause conflict.
%r{/}
By the way, Ruby onigmo regex engine does not allow look behind with variable length, so your regex will return an error anyway.
Don't reinvent wheels, especially ones that work:
require 'uri'
URI.split('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
# => ["http",
# "user:passwd",
# "www.example.com",
# "81",
# nil,
# "/path/to/index.html",
# nil,
# "foo=bar",
# "baz"]
Or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
uri.authority # => "user:passwd#www.example.com:81"
uri.fragment # => "baz"
uri.host # => "www.example.com"
uri.password # => "passwd"
uri.path # => "/path/to/index.html"
uri.port # => 81
uri.query # => "foo=bar"
uri.query_values # => {"foo"=>"bar"}
uri.scheme # => "http"
uri.to_hash # => {:scheme=>"http", :user=>"user", :password=>"passwd", :host=>"www.example.com", :port=>81, :path=>"/path/to/index.html", :query=>"foo=bar", :fragment=>"baz"}
uri.user # => "user"
Between the two, Addressable::URI is more full-featured and follows the specs very closely. Ruby's built-in URI is good for lighter-weight lifting.
Root around in their code and you'll find the regular expressions used to tear apart a URL; You'll also see that they aren't trivial because URLs can be quite "interesting", where "interesting" means you'll scream and pull out your hair. See the URI RFC for more information. See "Parsing a URI Reference with a Regular Expression" in that document for a suggested pattern.
...I do a exercise from codewars and I not allowed to use require
First, if so, why are you asking for help on how to write this? You are supposed to figure these things out yourself.
That said, try what has already been created. This uses the pattern in the RFC:
URI_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX).captures # !> assigned but unused variable - port
# => ["http:",
# "http",
# "//user:passwd#www.example.com:81",
# "user:passwd#www.example.com:81",
# "/path/to/index.html",
# "?foo=bar",
# "foo=bar",
# "#baz",
# "baz"]
user, passwd, host, port = uri_captures[3].split(/[:#]/)
host # => "www.example.com"
For further convenience, here's a simple pattern to provide named captures:
URI_REGEX = %r!^((?<scheme>[^:/?#]+):)?(//(?<authority>[^/?#]*))?(?<path>[^?#]*)(\?(?<query>[^#]*))?(?<fragment>#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX)
authority_captures = uri_captures['authority'].match(/(?<user>[^:]+)?:?(?<passwd>[^#]+)?#?(?<host>.+)(:(?<port>\d+)?)/)
authority_captures['host']
# => "www.example.com"

Code not working after little change [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm beginner in programming, I changed my code a little bit so that I can execute some def from the command line. After that change I got an error, but first my code:
require 'faraday'
#conn = Faraday.new 'https://zombost.de:8673/rest', :ssl => {:verify => false}
#uid = '8978'
def certificate
res = conn.get do |request|
request.url "acc/#{#uid}/cere"
end
end
def suchen(input)
#suche = input
#res = #conn.get do |request|
request.url "acc/?search=#{#suche}"
end
end
puts #res.body
Then I wrote into the console:
ruby prog.rb suchen(jumbo)
So, somehow i get the error:
Undefined method body for Nilclass
You're not invoking either of your methods, so #res is never assigned to.
#res evaluates to nil, so you're invoking nil.body.
RE: Your update:
ruby prog.rb suchen(jumbo)
That isn't how you invoke a method. You have to call it from within the source file. All you're doing is passing an argument to your script, which will be a simple string available in the ARGV array.
RE: Your comment:
It should go without saying that the solution is to actually invoke your method.
You can call the methods from the command line. Ruby has a function eval which evaluates a string. You can eval the command line argument strings.
Here's how. Change the line puts #res.body to
str = ARGV[1]
eval(ARGV[0])
puts #res.body
then run your program like so
$ ruby prog.rb suchen\(str\) jumbo

How to add additional parameters to url? - encode url [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
How to add additional parameters to url from a hash? For example:
parameters = Hash.new
parameters["special"] = '25235'
parameters["code"] = 62346234
http: //127.0.0.1:8000/api/book_category/? %s parameters
require 'httparty'
require 'json'
response = HTTParty.get("http://127.0.0.1:8000/api/book_category/?")
json = JSON.parse(response.body)
puts json
The following should give you a valid URI which you can use for the json query.
require 'httparty'
parameters = {'special' => '512351235','code' => 6126236}
uri = URI.parse('http://127.0.0.1:8000/api/book_category/').tap do |uri|
uri.query = URI.encode_www_form parameters
end
uri.to_s
#=> "http://127.0.0.1:8000/api/book_category/?special=512351235&code=6126236"
The Tin Mans comment on your question is probably the better answer:
require 'httparty'
parameters = {'special' => '512351235','code' => 6126236}
response = HTTParty.get('http://127.0.0.1:8000/api/book_category/', :query => parameters)
json = JSON.parse(response.body)
puts json
The Addressable::URI class is an excellent replacement for the URI module in the standard library, and provides for just such manipulation of URI strings without having to build and escape the query string by hand.
This code demonstrates
require 'addressable/uri'
include Addressable
uri = URI.parse('http://127.0.0.1:8000/api/book_category/')
parametrs = {}
parametrs["special"] = '25235'
parametrs["code"] = 62346234
uri.query_values = parametrs
puts uri
output
http://127.0.0.1:8000/api/book_category/?code=62346234&special=25235

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Resources