regex to remove the webpage part of a url in ruby - ruby

I am trying to remove the webpage part of the URL
For example,
www.example.com/home/index.html
to
www.example.com/home
any help appreciated.
Thanks

It's probably a good idea not to use regular expressions when possible. You may summon Cthulhu. Try using the URI library that's part of the standard library instead.
require "uri"
result = URI.parse("http://www.example.com/home/index.html")
result.host # => www.example.com
result.path # => "/home/index.html"
# The following line is rather unorthodox - is there a better solution?
File.dirname(result.path) # => "/home"
result.host + File.dirname(result.path) # => "www.example.com/home"

If your heart is set on using regex and you know that your URLs will be pretty straight forward you could use (.*)/.* to capture everything before the last / in your URL.
irb(main):007:0> url = "www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):008:0> regex = "(.*)/.*"
=> "(.*)/.*"
irb(main):009:0> url =~ /#{regex}/
=> 0
irb(main):010:0> $1
=> "www.example.com/home"

irb(main):001:0> url="www.example.com/home/index.html"
=> "www.example.com/home/index.html"
irb(main):002:0> url.split("/")[0..-2].join("/")
=> "www.example.com/home"

Related

how to take bigger urls by using URI.parse or Domainatrix.parse but not using gsub or split?

2.5.0 :150 > url = 'https://www.online.citibank.co.in/credit-card/apply'
=> "https://www.online.citibank.co.in/credit-card/apply"
2.5.0 :151 > Domainatrix.parse(url)
=> #<Domainatrix::Url:0x00007fd7850df4a8 #scheme="https", #host="www.online.citibank.co.in", #port="", #url="https://www.online.citibank.co.in/credit-card/apply", #public_suffix="co.in", #domain="citibank", #subdomain="www.online", #path="/credit-card/apply", #localhost=false, #ip=false>
2.5.0 :152 > Domainatrix.parse(url).domain_with_public_suffix
=> "citibank.co.in"
its getting "citibank.co.in"
but i required online.citibank.co.in with out using gsub or split thing
can any one help
Make use of URI#parse
While you will get www as well it's better to use this as you won't have to handle various cases manually.
url = 'https://www.online.citibank.co.in/credit-card/apply'
#=> "https://www.online.citibank.co.in/credit-card/apply"
parsed_url = URI.parse(url)
#=> #<URI::HTTPS https://www.online.citibank.co.in/credit-card/apply>
parsed_url.host
#=> "www.online.citibank.co.in"

How do you parse this string?

/events/3122671255551936/?ref=br_rs&action_history=null
I would just like to extract the number after '/events/' and before '/?ref=br_rs...
\
You could split it by the / character:
irb(main):003:0> "/events/3122671255551936/?ref=br_rs&action_history=null".split("/")[2]
=> "3122671255551936"
You can also use String#scan method to grab the digits:
"/events/3122671255551936/?ref=br_rs&action_history=null".scan(/\d+/).join
# => "3122671255551936"
If your string is str:
x = str["/events/".size..-1].to_i
#=> 3122671255551936
If you want the string:
x.to_s
#=> "3122671255551936"
You're looking at the path from a URL. A basic split will work initially:
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "3122671255551936"
There are existing tools to make this easy and that will handle encoding and decoding of special characters during processing of the URL:
require 'uri'
str = '/events/3122671255551936/?ref=br_rs&action_history=null'
scheme, userinfo, host, port, registry, path, opaque, query, fragment = URI.split(str)
scheme # => nil
userinfo # => nil
host # => nil
port # => nil
registry # => nil
path # => "/events/3122671255551936/"
opaque # => nil
query # => "ref=br_rs&action_history=null"
fragment # => nil
uri = URI.parse(str)
path accesses the path component of the URL:
uri.path # => "/events/3122671255551936/"
Making it easy to grab the value:
uri.path.split('/')[2] # => "3122671255551936"
Now, imagine if that URL had a scheme and host like "http://www.example.com/" prepended, as most URLs do. (Having written hundreds of spiders and scrapers, I know how easy it is to encounter such a change.) Using a naive split('/') would immediately break:
str = 'http://www.example.com/events/3122671255551936/?ref=br_rs&action_history=null'
str.split('/')[2] # => "www.example.com"
That means any solution relying on split alone would break, along with any others that try to locate the position of the value based on the entire string.
But using the tools designed for the job the code would continue working:
uri = URI.parse(str)
uri.path.split('/')[2] # => "3122671255551936"
Notice how simple and easy to read it is, which will transfer to being easier to maintain. It could even be simplified to:
URI.parse(str).path.split('/')[2] # => "3122671255551936"
and continue to work.
This is because URL/URI are an agreed-upon standard, making it possible to write a parser to take apart, and build, a string that conforms to the standard.
See the URI documentation for more information.

How to check if an url has numbers with fileEntityId or not?

I have some urls as below:
https://example.com/file/filegetrevision.do?fileEntityId=738007
9&cs=4Pzbb2jPu3EHBzv8RQHrGcPm4hZZkRC-CfH0my4dP0M.arv
https://example.com/file/filegetrevision.do?fileEntityId=&cs=2L
5cx4UsMsFJgM05pPtB_Z8dRdL4CXLLcTeDhGPDBIg.arv
https://example.com/file/filegetrevision.do?fileEntityId=2555874&cs=2L
5cx4UsMsFJgM05pPtB_Z8dRdL4CXLLcTeDhGPDBIg.arv
Now I need to check which url has numbers with fileEntityId or which has not? Any help in this regard? say first and third URL has 738007 and 2555874 numbers with fileEntityId but the second doesn't.
Thanks
As much as I love regexes, there are more appropriate tools included in the standard library:
require 'uri'
require 'cgi'
url = "https://example.com/file/filegetrevision.do?fileEntityId=7380079&cs=4Pzbb2jPu3EHBzv8RQHrGcPm4hZZkRC-CfH0my4dP0M.arv"
query = URI::parse(url).query # => "fileEntityId=7380079&cs=4Pzbb2jPu3EHBzv8RQHrGcPm4hZZkRC-CfH0my4dP0M.arv"
fileEntityId = CGI::parse(query)['fileEntityId'] # => ["7380079"]
Then you can check if it is a number or not.
Here we use regexes to find out if a string contains "fileEntityId=" followed by one or more digits:
urls = ['https://example.com/file/filegetrevision.do?fileEntityId=7380079&cs=4Pzbb2jPu3EHBzv8RQHrGcPm4hZZkRC-CfH0my4dP0M.arv',
'https://example.com/file/filegetrevision.do?fileEntityId=&cs=2L5cx4UsMsFJgM05pPtB_Z8dRdL4CXLLcTeDhGPDBIg.arv',
'https://example.com/file/filegetrevision.do?fileEntityId=2555874&cs=2L5cx4UsMsFJgM05pPtB_Z8dRdL4CXLLcTeDhGPDBIg.arv']
urls.map {|u| !!(u =~ /fileEntityId=\d+/)} # => [true, false, true]

How to parse a URL and extract the required substring

Say I have a string like this: "http://something.example.com/directory/"
What I want to do is to parse this string, and extract the "something" from the string.
The first step, is to obviously check to make sure that the string contains "http://" - otherwise, it should ignore the string.
But, how do I then just extract the "something" in that string? Assume that all the strings that this will be evaluating will have a similar structure (i.e. I am trying to extract the subdomain of the URL - if the string being examined is indeed a valid URL - where valid is starts with "http://").
Thanks.
P.S. I know how to check the first part, i.e. I can just simply split the string at the "http://" but that doesn't solve the full problem because that will produce "http://something.example.com/directory/". All I want is the "something", nothing else.
I'd do it this way:
require 'uri'
uri = URI.parse('http://something.example.com/directory/')
uri.host.split('.').first
=> "something"
URI is built into Ruby. It's not the most full-featured but it's plenty capable of doing this task for most URLs. If you have IRIs then look at Addressable::URI.
You could use URI like
uri = URI.parse("http://something.example.com/directory/")
puts uri.host
# "something.example.com"
and you could then just work on the host.
Or there is a gem domainatrix from Remove subdomain from string in ruby
require 'rubygems'
require 'domainatrix'
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
and you could just take the subdomain.
Well, you can use regular expressions.
Something like /http:\/\/([^\.]+)/, that is, the first group of non '.' letters after http.
Check out http://rubular.com/. You can test your regular expressions against a set of tests too, it's great for learning this tool.
with URI.parse you can get:
require "uri"
uri = URI.parse("http://localhost:3000")
uri.scheme # http
uri.host # localhost
uri.port # 3000

ruby string splitting problem

i have this string:
"asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asdmas=asdakmsd&asmda=adasda"
i want to get the value after between the ACK and the & symbol, the value between the ACK and the & symbol can be changed...
thanks
i want the solution in ruby.
require "cgi"
query_string = "asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asmda=asdakmsd"
parsed_query_string = CGI.parse(query_string)
#=> { "asdasda" => ["asdaskdmasd"],
# "asmda" => ["asdasmda", "asdakmsd"],
# "ACK" => ["Success"] }
parsed_query_string["ACK"].first
#=> "Success"
If you also want to reconstruct the query string (especially together with the rest of a URL), I would recommend looking into the addressable gem.
require "addressable/uri"
# Note the leading '?'
query_string = "?asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asmda=asdakmsd"
parsed_uri = Addressable::URI.parse(query_string)
parsed_uri.query_values["ACK"]
#=> "Success"
parsed_uri.query_values = parsed_uri.query_values.merge("ACK" => "Changed")
parsed_uri.to_s
#=> "?ACK=Changed&asdasda=asdaskdmasd&asmda=asdakmsd"
# Note how the order has changed and the duplicate key has been removed due to
# Addressable's built-in normalisation.
"asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asdmas=asdakmsd&asmda=adasda"[/ACK=([^&]*)&/]
$1 # => 'Success'
A quick approach:
s = "asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asdmas=asdakmsd&asmda=adasda"
s.gsub(/ACK[=\w]+&/,"ACK[changedValue]&")
#=> asdasda=asdaskdmasd&asmda=asdasmda&ACK[changedValue]&asdmas=asdakmsd&asmda=adasda
s = "asdasda=asdaskdmasd&asmda=asdasmda&ACK=Success&asdmas=asdakmsd&asmda=adasda"
m = s.match /.*ACK=(.*?)&/
puts m[1]
and just for fun without regexp:
Hash[s.split("&").map{|p| p.split("=")}]["ACK"]

Resources