How do you parse this string? - ruby

/events/3122671255551936/?ref=br_rs&action_history=null
I would just like to extract the number after '/events/' and before '/?ref=br_rs...
\

You could split it by the / character:
irb(main):003:0> "/events/3122671255551936/?ref=br_rs&action_history=null".split("/")[2]
=> "3122671255551936"

You can also use String#scan method to grab the digits:
"/events/3122671255551936/?ref=br_rs&action_history=null".scan(/\d+/).join
# => "3122671255551936"

If your string is str:
x = str["/events/".size..-1].to_i
#=> 3122671255551936
If you want the string:
x.to_s
#=> "3122671255551936"

You're looking at the path from a URL. A basic split will work initially:
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "3122671255551936"
There are existing tools to make this easy and that will handle encoding and decoding of special characters during processing of the URL:
require 'uri'
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
scheme, userinfo, host, port, registry, path, opaque, query, fragment = URI.split(str)
scheme # => nil
userinfo # => nil
host # => nil
port # => nil
registry # => nil
path # => "/events/3122671255551936/"
opaque # => nil
query # => "ref=br_rs&action_history=null"
fragment # => nil
uri = URI.parse(str)
path accesses the path component of the URL:
uri.path # => "/events/3122671255551936/"
Making it easy to grab the value:
uri.path.split('/')[2] # => "3122671255551936"
Now, imagine if that URL had a scheme and host like "http://www.example.com/" prepended, as most URLs do. (Having written hundreds of spiders and scrapers, I know how easy it is to encounter such a change.) Using a naive split('/') would immediately break:
str = 'http://www.example.com/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "www.example.com"
That means any solution relying on split alone would break, along with any others that try to locate the position of the value based on the entire string.
But using the tools designed for the job the code would continue working:
uri = URI.parse(str)
uri.path.split('/')[2] # => "3122671255551936"
Notice how simple and easy to read it is, which will transfer to being easier to maintain. It could even be simplified to:
URI.parse(str).path.split('/')[2] # => "3122671255551936"
and continue to work.
This is because URL/URI are an agreed-upon standard, making it possible to write a parser to take apart, and build, a string that conforms to the standard.
See the URI documentation for more information.

Related

How to replace string in URL with captured regex pattern

I want to replace 'hoge' to 'foo' with regex. But the user's value is dynamic so I can't use str.gsub('hoge', 'foo').
str = '?user=hoge&tab=fuga'
What should I do?
Don't do this with a regular expression.
This is how to manipulate URIs using the existing wheels:
require 'uri'
str = 'http://example.com?user=hoge&tab=fuga'
uri = URI.parse(str)
query = URI.decode_www_form(uri.query).to_h # => {"user"=>"hoge", "tab"=>"fuga"}
query['user'] = 'foo'
uri.query = URI.encode_www_form(query)
uri.to_s # => "http://example.com?user=foo&tab=fuga"
Alternately:
require 'addressable'
uri = Addressable::URI.parse('http://example.com?tab=fuga&user=hoge')
query = uri.query_values # => {"tab"=>"fuga", "user"=>"hoge"}
query['user'] = 'foo'
uri.query_values = query
uri.to_s # => "http://example.com?tab=fuga&user=foo"
Note that in the examples the order of the parameters changed, but the code handled the difference without problems.
The reason you want to use URI or Addressable is because parameters and values have to be correctly encoded when they contain illegal characters. URI and Addressable know the rules and will follow them, whereas naive code assumes it's OK to not bother with encoding, causing broken URIs.
URI is part of the Ruby Standard Library, and Addressable is more full-featured. Take your pick.
You can try below regex
([?&]user=)([^&]+)
DEMO
You probably want to find out what the user query maps to first before using a .gsub to replace whatever value it is.
First, parse the URL string into an URI object using the URI module. And then, you can use the CGI query methods to get the key value pairs of the query params off the URI object using the CGI module. And finally, you can .gsub off the values in that hash.

How to escape two forward slashes in Ruby Regex [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to make a regex which finds the domain name using Ruby, so I tried this:
(?<=.*/).(?=.*/)
On Rubular I always see this error message: Forward slashes must be escaped.
How do I solve this?
When you use the // regex literal, you need to escape / using a backslash as \/. When you want literal / in your regex, it is usually simpler to avoid using the // literal. For example, use %r literal with any delimiters that would not cause conflict.
%r{/}
By the way, Ruby onigmo regex engine does not allow look behind with variable length, so your regex will return an error anyway.
Don't reinvent wheels, especially ones that work:
require 'uri'
URI.split('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
# => ["http",
# "user:passwd",
# "www.example.com",
# "81",
# nil,
# "/path/to/index.html",
# nil,
# "foo=bar",
# "baz"]
Or:
require 'addressable/uri'
uri = Addressable::URI.parse('http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz')
uri.authority # => "user:passwd#www.example.com:81"
uri.fragment # => "baz"
uri.host # => "www.example.com"
uri.password # => "passwd"
uri.path # => "/path/to/index.html"
uri.port # => 81
uri.query # => "foo=bar"
uri.query_values # => {"foo"=>"bar"}
uri.scheme # => "http"
uri.to_hash # => {:scheme=>"http", :user=>"user", :password=>"passwd", :host=>"www.example.com", :port=>81, :path=>"/path/to/index.html", :query=>"foo=bar", :fragment=>"baz"}
uri.user # => "user"
Between the two, Addressable::URI is more full-featured and follows the specs very closely. Ruby's built-in URI is good for lighter-weight lifting.
Root around in their code and you'll find the regular expressions used to tear apart a URL; You'll also see that they aren't trivial because URLs can be quite "interesting", where "interesting" means you'll scream and pull out your hair. See the URI RFC for more information. See "Parsing a URI Reference with a Regular Expression" in that document for a suggested pattern.
...I do a exercise from codewars and I not allowed to use require
First, if so, why are you asking for help on how to write this? You are supposed to figure these things out yourself.
That said, try what has already been created. This uses the pattern in the RFC:
URI_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX).captures # !> assigned but unused variable - port
# => ["http:",
# "http",
# "//user:passwd#www.example.com:81",
# "user:passwd#www.example.com:81",
# "/path/to/index.html",
# "?foo=bar",
# "foo=bar",
# "#baz",
# "baz"]
user, passwd, host, port = uri_captures[3].split(/[:#]/)
host # => "www.example.com"
For further convenience, here's a simple pattern to provide named captures:
URI_REGEX = %r!^((?<scheme>[^:/?#]+):)?(//(?<authority>[^/?#]*))?(?<path>[^?#]*)(\?(?<query>[^#]*))?(?<fragment>#(.*))?!
uri_captures = 'http://user:passwd#www.example.com:81/path/to/index.html?foo=bar#baz'.match(URI_REGEX)
authority_captures = uri_captures['authority'].match(/(?<user>[^:]+)?:?(?<passwd>[^#]+)?#?(?<host>.+)(:(?<port>\d+)?)/)
authority_captures['host']
# => "www.example.com"

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

regex to remove the webpage part of a url in ruby

I am trying to remove the webpage part of the URL
For example,
www.example.com/home/index.html
to
www.example.com/home
any help appreciated.
Thanks
It's probably a good idea not to use regular expressions when possible. You may summon Cthulhu. Try using the URI library that's part of the standard library instead.
require "uri"
result = URI.parse("http://www.example.com/home/index.html")
result.host # => www.example.com
result.path # => "/home/index.html"
# The following line is rather unorthodox - is there a better solution?
File.dirname(result.path) # => "/home"
result.host + File.dirname(result.path) # => "www.example.com/home"
If your heart is set on using regex and you know that your URLs will be pretty straight forward you could use (.*)/.* to capture everything before the last / in your URL.
irb(main):007:0> url = "www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):008:0> regex = "(.*)/.*"
=> "(.*)/.*"
irb(main):009:0> url =~ /#{regex}/
=> 0
irb(main):010:0> $1
=> "www.example.com/home"
irb(main):001:0> url="www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):002:0> url.split("/")[0..-2].join("/")
=> "www.example.com/home"

Ruby: How can I have a Hash take multiple keys?

I'm taking 5 strings (protocol, source IP and port, destination IP and port) and using them to store some values in a hash. The problem is that if the IPs or ports are switched between source and destination, the key is supposed to be the same.
If I was doing this in C#/Java/whatever I'd have to create a new class and overwrite the hashcode()/equals() methods, but that seems error prone from the little I've read about it and I was wondering if there would be a better alternative here.
I am directly copying a paragraph from Programming Ruby 1.9:
Hash keys must respond to the message hash by returning a hash code, and the hash code for a given key must not change. The keys used in hashes must also be comparable using eql?. If eql? returns true for two keys, then those keys must also have the same hash code. This means that certain classes (such as Array and Hash) can't conveniently be used as keys, because their hash values can change based on their contents.
So you might generate your hash as something like ["#{source_ip} #{source_port}", "#{dest_ip} #{dest_port}", protocol.to_s].sort.join.hash such that the result will be identical when the source and destination are switched.
For example:
source_ip = "1.2.3.4"
source_port = 1234
dest_ip = "5.6.7.8"
dest_port = 5678
protocol = "http"
def make_hash(s_ip, s_port, d_ip, d_port, proto)
["#{s_ip} #{s_port}", "#{d_ip} #{d_port}", proto.to_s].sort.join.hash
end
puts make_hash(source_ip, source_port, dest_ip, dest_port, protocol)
puts make_hash(dest_ip, dest_port, source_ip, source_port, protocol)
This will output the same hash even though the arguments are in a different order between the two calls. Correctly encapsulating this functionality into a class is left as an exercise to the reader.
I think this is what you mean...
irb(main):001:0> traffic = []
=> []
irb(main):002:0> traffic << {:src_ip => "10.0.0.1", :src_port => "9999", :dst_ip => "172.16.1.1", :dst_port => 80, :protocol => "tcp"}
=> [{:protocol=>"tcp", :src_ip=>"10.0.0.1", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}]
irb(main):003:0> traffic << {:src_ip => "10.0.0.2", :src_port => "9999", :dst_ip => "172.16.1.1", :dst_port => 80, :protocol => "tcp"}
=> [{:protocol=>"tcp", :src_ip=>"10.0.0.1", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}, {:protocol=>"tcp", :src_ip=>"10.0.0.2", :src_port=>"9999", :dst_ip=>"172.16.1.1", :dst_port=>80}]
The next, somewhat related, question is how to store the IP. You probably want to use the IPAddr object instead of just a string so you can sort the results more easily.
You can use the following code:
def create_hash(prot, s_ip, s_port, d_ip, d_port, value, x = nil)
if x
x[prot] = {s_ip => {s_port => {d_ip => {d_port => value}}}}
else
{prot => {s_ip => {s_port => {d_ip => {d_port => value}}}}}
end
end
# Create a value
h = create_hash('www', '1.2.4.5', '4322', '4.5.6.7', '80', "Some WWW value")
# Add another value
create_hash('https', '1.2.4.5', '4562', '4.5.6.7', '443', "Some HTTPS value", h)
# Retrieve the values
puts h['www']['1.2.4.5']['4322']['4.5.6.7']['80']
puts h['https']['1.2.4.5']['4562']['4.5.6.7']['443']

Resources