Extract URLs from text using Ruby while handling matched parens - ruby

URI.extract claims to do this, but it doesn't handle matched parens:
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
What's the best way to extract URLs from text without breaking parenthesized URLs (which users like to use)?

If the URLs are always bound by parentheses a Regular Expression might be a better solution.
text = "text here (http://foo.example.org/bla) and here and here is (http://yet.another.url/with/parens) and some more text"
text.scan /\(([^\)]*)\)/

Before using this
>> URI.extract("text here (http://foo.example.org/bla) and here")
=> ["http://foo.example.org/bla)"]
You need to add this
require 'uri'

You could use this regexp to extract URL's from a string
"some thing http://abcd.com/ and http://google.com are great".scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)

Related

remove `\"` from string rails 4

I have params like:
params[:id]= "\"ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6\""
And i want to get expected result as below:
"ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6"
How can I do this?
You can use gsub:
"\"ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6\"".gsub("\"", "")
=> "ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6"
Or, as #Stefan mentioned, delete:
"\"ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6\"".delete("\"")
=> "ebfd11a9-3aa4-415a-ba72-1b6796ea1bf6"
If this is JSON data, which it could very well be in that format:
JSON.load(params[:id])
This handles things where there's somehow escaped strings in there, or the parameters are an array.
Just Use tr!
params[:id].tr!("\"","")
tr! will also change the main string
In case you do not want to change main string just use :
params[:id].tr("\"","")
Thanks Ilya

Regular Expression find usage of word after "/" in URL

I am trying to parse through URLs using Ruby and return the URLs that match a word after the "/" in .com , .org , etc.
If I am trying to capture "questions" in a URL such as
https://stackoverflow.com/questions I also want to be able to capture https://stackoverflow.com/blah/questions. But I do not want to capture https://stackoverflow.com/queStioNs.
Currently my expression can match https://stackoverflow.com/questions but cannot match with "questions" after another "/", or 2 "/"s, etc.
The end of my regular expression is using \bquestions\.
I tried doing ([a-zA-Z]+\W{1}+\bjob\b|\bjob\b) but this only gets me URLs with /questions and /blah/questions but not /blah/bleh/questions.
What am I doing wrong and how do I match what I need?
You don't actually need a regex for this, you can instead use the URI module:
require 'uri'
urls = ['https://stackoverflow.com/blah/questions', 'https://stackoverflow.com/queStioNs']
urls.each do |url|
the_path = URI(url).path
puts the_path if the_path.include?'questions'
end
I don't know whether there is any simple way around, here is my solution:
regexp = '^(https|http)?:\/\/[\w]+\.(com|org|edu)(\/{1}[a-z]+)*$'
group_length = "https://stackoverflow.com/blah/questions".match(regexp).length
"https://stackoverflow.com/blah/questions".match(regexp)[group_length - 1].gsub("/","")
It will return 'questions'.
Update as per you comments below:
use [\S]*(\/questions){1}$
Hope it helps :)

Converting Jsonp to Json in different methods

I been trying to use JSONP data in a json format in a ruby project.
From your experiences how did you address this?
JSONP is easy to handle. It's just JSON in a minor wrapper, and that wrapper is easy to strip off:
require 'open-uri'
require 'json'
URL = 'http://www.google.com/dictionary/json?callback=a&sl=en&tl=en&q=epitome'
jsonp = open(URL).read
jsonp now contains the result in JSONP format:
jsonp[0, 3] # => "a({"
jsonp[-11 ... -1] # => "},200,null"
Those extraneous parts, a{ and ,200,null" are the trouble spots when passing the data to JSON for parsing, so we strip them.
A simple, greedy, regex is all that's needed. /{.+}/ will find everything wrapped by the outermost curly-braces and return it, which is all the JSON needs:
data = JSON.parse(jsonp[/{.+}/])
data['query'] # => "epitome"
data['primaries'].size # => 1
From my experience, one way is to use this regex to filter out the function callback name:
/(\{.*\})/m
or the lazy way would be find the index of the first occurrence of "(" and just substring it with last character, which would be a ")" .
I been trying to look for answers on here, didn't get a solid answer, hope this helps.
Cheers

How do I get just the sitename from url in ruby?

I have a url such as:
http://www.relevantmagazine.com/life/relationship/blog/23317-pursuing-singleness
And would like to extract just relevantmagazine from it.
Currently I have:
#urlroot = URI.parse(#link.url).host
But it returns www.relevantmagazine.com can anyone help me?
Using a gem for this might be overkill, but anyway: There's a handy gem called domainatrix that can extract the sitename for your while dealing with things like two element top-level domains and more.
url = Domainatrix.parse("http://www.pauldix.net")
url.url # => "http://www.pauldix.net" (the original url)
url.public_suffix # => "net"
url.domain # => "pauldix"
url.canonical # => "net.pauldix"
url = Domainatrix.parse("http://foo.bar.pauldix.co.uk/asdf.html?q=arg")
url.public_suffix # => "co.uk"
url.domain # => "pauldix"
url.subdomain # => "foo.bar"
url.path # => "/asdf.html?q=arg"
url.canonical # => "uk.co.pauldix.bar.foo/asdf.html?q=arg"
how about
#urlroot = URI.parse(#link.url).host.gsub("www.", "").split(".")[0]
Try this regular expression:
regex = %r{http://[w]*[\.]*[^/|$]*}
If you had the following url strings, it gives the following:
url = 'http://www.google.com/?q=blah'
url.scan(regex) => ["http://www.google.com"]
url = 'http://google.com/?q=blah'
url.scan(regex) => ["http://google.com"]
url = 'http://google.com'
url.scan(regex) => ["http://google.com"]
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo.bar.pauldix.co.uk"]
It's not perfect, but it will strip out everything but the prefix and the host name. You can then easily clean up the prefix with some other code knowing now you only need to look for an http:// or http://www. at the beginning of the string. Another thought is you may need to tweak the regex I gave you a little if you are also going to parse https://. I hope this helps you get started!
Edit:
I reread the question, and realized my answer doesn't really do what you asked. I suppose it might be helpful to know if you know if the urls you're parsing will have a set format like always have the www. If it does, you could use a regular expression that extracts everything between the first and second period in the url. If not, perhaps you could tweak my regex so that it's everything between the / or www. and the first period. That might be the easiest way to get just the site name with none of the www. or the .com or .au.uk and such.
Revised regex:
regex = %r{http://[w]*[\.]*[^\.]*}
url = 'http://foo.bar.pauldix.co.uk/asdf.html?q=arg'
url.scan(regex) => ["http://foo"]
It'll be weird. If you use the regex stuff, you'll probably have to do it incrementally to clean up the url to extract the part you want.
Maybe you can just split it?
URI.parse(#link.url).host.split('.')[1]
Keep in mind that some registered domains may have more than one component to the registered country domain, like .co.uk or .co.jp or .com.au for example.
I found the answer inspired by tadman's answer and the answer in another question
#urlroot = URI.parse(item.url).host
#urlroot = #urlroot.start_with?('www.') ? #urlroot[4..-1] : #urlroot
#urlroot = #urlroot.split('.')[0]
First line get the host, second line gets removes the www. if they is one and third line get everything before the next dot.

Ruby: replace a given URL in an HTML string

In Ruby, I want to replace a given URL in an HTML string.
Here is my unsuccessful attempt:
escaped_url = url.gsub(/\//,"\/").gsub(/\./,"\.").gsub(/\?/,"\?")
path_regexp = Regexp.new(escaped_url)
html.gsub!(path_regexp, new_url)
Note: url is actually a Google Chart request URL I wrote, which will not have more special characters than /?|.=%:
The gsub method can take a string or a Regexp as its first argument, same goes for gsub!. For example:
>> 'here is some ..text.. xxtextxx'.gsub('..text..', 'pancakes')
=> "here is some pancakes xxtextxx"
So you don't need to bother with a regex or escaping at all, just do a straight string replacement:
html.gsub!(url, new_url)
Or better, use an HTML parser to find the particular node you're looking for and do a simple attribute assignment.
I think you're looking for something like:
path_regexp = Regexp.new(Regexp.escape(url))

Resources