Add space between nodes with Nokogiri - ruby

I have a string of HTML where I want to strip all the html tags. The problem is that the plain text of each node is squished together and I need to add some whitespace between each node.
Nokogiri::HTML("<p>Hello</p><p>There</p>").text
Gives => HelloThere
I want => Hello There
Can I tell Nokogiri to behave like this somehow?

You can do
doc = Nokogiri::HTML("<p>Hello</p><p>There</p>")
doc.xpath('//text()').to_a.join(" ")

Nokogiri::HTML("<p>Hello</p><p>There</p>").xpath("//*[not(child::*)]").map(&:text).join(' ')
# => "Hello There"

EDIT: I tried to do it on my own but ended using a solution which slightly looks like Uri Agassi's :)
irb(main):040:0> Nokogiri::HTML("<p>Hello</p><p>There</p>").xpath("//text()").map(&:text).join(" ")
=> "Hello There"

Related

Remove only anchor tag from string

In controller:
str= "Employee <b><a href=http://xyz.localhost.in:3000/admin/company>Uday Das</a></b> has applied for leave."
I want to remove anchor tag from above string like Employee <b>Uday Das</b> has applied for leave.,
I used this code:
ActionView::Base.full_sanitizer.sanitize(str)
But it removes all the html tags from the string, as a result i am getting Employee Uday Das has applied for leave..
NOTE: I am getting strings which is dynamic, anchor tag position is not fixed, it could be anywhere in the string.
You can use nokogiri gem.
Something like:
require 'nokogiri'
doc = Nokogiri::HTML str
node = doc.at("a")
node.replace(node.text)
puts puts doc.inner_html
# <html><body><p>Employee <b>Uday Das</b> has applied for leave.</p></body></html>
or to match your exact output:
puts doc.at("p").inner_html
# Employee <b>Uday Das</b> has applied for leave.
I got a simple solution:
include ActionView::Helpers::SanitizeHelper
sanitize(str, :tags=>["b"])
For links, you can use strip_links method from ActionView::Helpers::SanitizeHelper
strip_links('Ruby on Rails')
# => Ruby on Rails
strip_links('Please e-mail me at me#email.com.')
# => Please e-mail me at me#email.com.
strip_links('Blog: Visit.')
# => Blog: Visit.
strip_links('<malformed & link')
# => <malformed & link

Ruby regex help to replace substring

I need to replace field_to_replace from
...<div>\r\n<span field=\"field_to_replace\">\r\n<div>....
There are multiple occurrences of field_to_replace in the string. I need to replace only this occurrence using the tag before and after it.
Don't use regular expressions to try to search or replace inside HTML or XML unless you are guaranteed that the source layout won't change. It's really easy to use a parser to make the changes, and they'll easily handle changes to the source.
This would replace all occurrences of the string in the HTML:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field='field_to_replace'><div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div></div></span></div>"
doc.search('div span[#field]').each do |span|
span['field'] = 'foo'
end
doc.to_html # => "<div><span field=\"foo\"><div></div></span></div>"
If you want to replace just the first occurrence, use at instead of search:
doc = Nokogiri::HTML::DocumentFragment.parse("<div><span field=\"field_to_replace\"><div><span field='field_to_replace'></span></div></span></div>")
doc.to_html # => "<div><span field=\"field_to_replace\"><div><span field=\"field_to_replace\"></span></div></span></div>"
doc.at('div span[#field]')['field'] = 'foo'
doc.to_html # => "<div><span field=\"foo\"><div><span field=\"field_to_replace\"></span></div></span></div>"
By defining the CSS selector you can identify the node quickly and easily. And, if you need even more power then XPath can be used instead of CSS.
The simple way would be:
str = "...<div>\r\n<span field=\"field_to_replace\">\r\n<span field=\"field_to_replace\">\r\n<div>...."
str.split("field_to_replace").join("new_field")
Let us know if you need something more complex.

Capture string between specific characters

Can someone help me extract the string:
Advice about something
from below:
<TITLE>Advice about something</TITLE>
The expression should be able to capture the string between <TITLE> and </TITLE>. I tried expressions such as [^TITLE<g\/], but couldn't get the right output.
If you want a robust solution rather than a temporal hack, then use specific parsers.
require "cgi"
require "nokogiri"
Nokogiri.parse(CGI.unescapeHTML(
"<TITLE>Advice about something</TITLE>"
))
.xpath("TITLE").text
# => "Advice about something"
Take the left part <TITLE> and the right part </TITLE> and put (.*?) in between:<TITLE>(.*?)<\/TITLE>
Online demo
Depends. Is the string always delimited by semi-columns?
tmp = "<TITLE>Advice about something</TITLE>"
=> "<TITLE>Advice about something</TITLE>"
tmp.split(';')[2].gsub(/\&lt/, "")
=> "Advice about something"

Getting portion of href attribute using hpricot

I think I need a combo of hpricot and regex here. I need to search for 'a' tags with an 'href' attribute that starts with 'abc/', and returns the text following that until the next forward slash '/'.
So, given:
One
Two
I need to get back:
'12345'
and
'67890'
Can anyone lend a hand? I've been struggling with this.
You don't need regex but you can use it. Here's two examples, one with regex and the other without, using Nokogiri, which should be compatible with Hpricot for your use, and uses CSS accessors:
require 'nokogiri'
html = %q[
One
Two
]
doc = Nokogiri::HTML(html)
doc.css('a[#href]').map{ |h| h['href'][/(\d+)/, 1] } # => ["12345", "67890"]
doc.css('a[#href]').map{ |h| h['href'].split('/')[2] } # => ["12345", "67890"]
or use regex:
s = 'One'
s =~ /abc\/([^\/]*)/
return $1
What about splitting the string by /?
(I don't know Hpricot, but according to the docs):
doc.search("a[#href]").each do |a|
return a.somemethodtogettheattribute("href").split("/")[2]; // 2, because the string starts with '/'
end

Extract all urls inside a string in Ruby

I have some text content with a list of URLs contained in it.
I am trying to grab all the URLs out and put them in an array.
I have this code
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html"
urls = content.scan(/^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix)
I am trying to get the end results to be:
['http://www.google.com', 'http://www.google.com/index.html']
The above code does not seem to be working correctly. Does anyone know what I am doing wrong?
Thanks
Easy:
ruby-1.9.2-p136 :006 > require 'uri'
ruby-1.9.2-p136 :006 > URI.extract(content, ['http', 'https'])
=> ["http://www.google.com", "http://www.google.com/index.html"]
A different approach, from the perfect-is-the-enemy-of-the-good school of thought:
urls = content.split(/\s+/).find_all { |u| u =~ /^https?:/ }
I haven't checked the syntax of your regex, but String.scan will produce an array, each of whose members is an array of the groups matched by your regex. So I'd expect the result to be:
[['http', '.google.com'], ...]
You'll need non-matching groups /(?:stuff)/ if you want the format you've given.
Edit (looking at regex): Also, your regex does look a bit wrong. You don't want the start and end anchors (^ and $), since you don't expect the matches to be at start and end of content. Secondly, if your ([0-9]{1,5})? is trying to capture a port number, I think you're missing a colon to separate the domain from the port.
Further edit, after playing: I think you want something like this:
content = "Here is the list of URLs: http://www.google.com http://www.google.com/index.html http://example.com:3000/foo"
urls = content.scan(/(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix)
# => ["http://www.google.com", "http://www.google.com/index.html", "http://example.com:3000/foo"]
... but note that it won't match pure IP-address URLs (like http://127.0.0.1), because of the [a-z]{2,5} for the TLD.
just for your interest:
Ruby has an URI Module, which has a regex implemented to do such things:
require "uri"
uris_you_want_to_grap = ['ftp','http','https','ftp','mailto','see']
html_string.scan(URI.regexp(uris_you_want_to_grap)) do |*matches|
urls << $&
end
For more information visit the Ruby Ref: URI
The most upvoted answer was causing issues with Markdown URLs for me, so I had to figure out a regex to extract URLs. Below is what I use:
URL_REGEX = /(https?:\/\/\S+?)(?:[\s)]|$)/i
content.scan(URL_REGEX).flatten
The last part here (?:[\s)]|$) is used to identify the end of the URL and you can add characters there as per your need and content. Right now it looks for any space characters, closing bracket or end of string.
content = "link in text [link1](http://www.example.com/test) and [link2](http://www.example.com/test2)
http://www.example.com/test3
http://www.example.com/test4"
returns ["http://www.example.com/test", "http://www.example.com/test2", "http://www.example.com/test3", "http://www.example.com/test4"].

Resources