Getting portion of href attribute using hpricot - ruby

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.

You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]

or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1

What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Related

Remove only anchor tag from string

In controller:
str= "Employee <b><a href=http://xyz.localhost.in:3000/admin/company>Uday Das</a></b> has applied for leave."
I want to remove anchor tag from above string like Employee <b>Uday Das</b> has applied for leave.,
I used this code:
ActionView::Base.full_sanitizer.sanitize(str)
But it removes all the html tags from the string, as a result i am getting Employee Uday Das has applied for leave..
NOTE: I am getting strings which is dynamic, anchor tag position is not fixed, it could be anywhere in the string.
You can use nokogiri gem.
Something like:
require 'nokogiri'
doc = Nokogiri::HTML str
node = doc.at("a")
node.replace(node.text)
puts puts doc.inner_html
# <html><body><p>Employee <b>Uday Das</b> has applied for leave.</p></body></html>
or to match your exact output:
puts doc.at("p").inner_html
# Employee <b>Uday Das</b> has applied for leave.
I got a simple solution:
include ActionView::Helpers::SanitizeHelper
sanitize(str, :tags=>["b"])
For links, you can use strip_links method from ActionView::Helpers::SanitizeHelper
strip_links('Ruby on Rails')
# => Ruby on Rails
strip_links('Please e-mail me at me#email.com.')
# => Please e-mail me at me#email.com.
strip_links('Blog: Visit.')
# => Blog: Visit.
strip_links('<malformed & link')
# => <malformed & link

Ruby regex help to replace substring

I need to replace field_to_replace from
...<div>\r\n<span field=\"field_to_replace\">\r\n<div>....
There are multiple occurrences of field_to_replace in the string. I need to replace only this occurrence using the tag before and after it.
Don't use regular expressions to try to search or replace inside HTML or XML unless you are guaranteed that the source layout won't change. It's really easy to use a parser to make the changes, and they'll easily handle changes to the source.
This would replace all occurrences of the string in the HTML:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field='field_to_replace'><div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div></div></span></div>"
doc.search('div span[#field]').each do |span|
span['field'] = 'foo'
end
doc.to_html # => "<div><span field=\"foo\"><div></div></span></div>"
If you want to replace just the first occurrence, use at instead of search:
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field=\"field_to_replace\"><div><span field='field_to_replace'></span></div></span></div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div><span field=\"field_to_replace\"></span></div></span></div>"
doc.at('div span[#field]')['field'] = 'foo'
doc.to_html # => "<div><span field=\"foo\"><div><span field=\"field_to_replace\"></span></div></span></div>"
By defining the CSS selector you can identify the node quickly and easily. And, if you need even more power then XPath can be used instead of CSS.
The simple way would be:
str = "...<div>\r\n<span field=\"field_to_replace\">\r\n<span field=\"field_to_replace\">\r\n<div>...."
str.split("field_to_replace").join("new_field")
Let us know if you need something more complex.

Add space between nodes with Nokogiri

I have a string of HTML where I want to strip all the html tags. The problem is that the plain text of each node is squished together and I need to add some whitespace between each node.
Nokogiri::HTML("<p>Hello</p><p>There</p>").text
Gives => HelloThere
I want => Hello There
Can I tell Nokogiri to behave like this somehow?
You can do
doc = Nokogiri::HTML("<p>Hello</p><p>There</p>")
doc.xpath('//text()').to_a.join(" ")
Nokogiri::HTML("<p>Hello</p><p>There</p>").xpath("//*[not(child::*)]").map(&:text).join(' ')
# => "Hello There"
EDIT: I tried to do it on my own but ended using a solution which slightly looks like Uri Agassi's :)
irb(main):040:0> Nokogiri::HTML("<p>Hello</p><p>There</p>").xpath("//text()").map(&:text).join(" ")
=> "Hello There"

What is a regex to check to see if some text contains only URLs?

I'm trying to make a regular expression that checks if some text only contains urls and whitespaces and nothing else so:
http://www.google.com http://www.stackoverflow.com
would match, but:
http://www.google.com and http://www.stackoverflow.com
would not match.
Is this possible?
you can use this regex (only test if that is between spaces begin with http://):
/^(?:https?:\/\/\S++\s*+)++$/ =~ text
Ruby already has a method to extract URLs, so that's a great starting place, rather than reinventing a working wheel:
require 'uri'
[
'http://www.google.com http://www.stackoverflow.com',
'http://www.google.com and http://www.stackoverflow.com'
].each do |url|
print url
if url.split.all? { |u| !URI.extract(u).empty? }
puts " contains only URLs"
else
puts " doesn't contain only URLs"
end
end
Which, after running, is:
http://www.google.com http://www.stackoverflow.com contains only URLs
http://www.google.com and http://www.stackoverflow.com doesn't contain only URLs
This doesn't support all the recognized URL schemes, but it is a starting point. You can specify which you want by passing an array of schemes to extract. You can get the IANA's permanent list using:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.iana.org/assignments/uri-schemes.html'))
schemes = doc.at('table table').search('tr').map{ |tr| tr.at('td').text }[1..-1]
words.split.all? { |word| word.match(/^http:/) }
This will check for any URL and the string should be URLs with single white-space as URLs separator only
Look at this live demo
(((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)\s){1,}((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)$
Reference:
http://www.regular-expressions.info/reference.html
http://regexlib.com/Search.aspx?k=URL&AspxAutoDetectCookieSupport=1
If you really want to use regex, please try this:
(?< protocol>\w+):\/\/(?< domain>[\w#][\w.:#]+)\/?[\w\.?=%&=\-#/$,]*
Please remove the space before 'protocol' and 'domain'.
Split the string with the whitespaces, and check each string if it is match with the regex above.
Hope it helps!

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Resources