How do I select IDs using xpath in Nokogiri? - ruby

Using this code:
doc = Nokogiri::HTML(open("text.html"))
doc.xpath("//span[#id='startsWith_']").remove
I would like to select every span#id starting with 'startsWith_' and remove it. I tried searching, but failed.

Here's an example:
require 'nokogiri'
html = '
<html>
<body>
<span id="doesnt_start_with">foo</span>
<span id="startsWith_bar">bar</span>
</body>
</html>'
doc = Nokogiri::HTML(html)
p doc.search('//span[starts-with(#id, "startsWith_")]').to_xml
That's how to select them.
doc.search('//span[starts-with(#id, "startsWith_")]').each do |n|
n.remove
end
That's how to remove them.
p doc.to_xml
# >> "<span id=\"startsWith_bar\">bar</span>"
# >> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n <span id=\"doesnt_start_with\">foo</span>\n \n</body></html>\n"
The page "XPath, XQuery, and XSLT Functions" has a list of the available functions.

Try this xpath expression:
//span[starts-with(#id, 'startsWith_')]

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How to replace outer tags using Nokogiri

Using Nokogiri, I'm trying to replace the outer tags of a HTML node where the most reliable way to detect it is through one of its children.
Before:
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
After:
<blockquote>
Words of wisdom
</blockquote>
The following code snippet detects the element I'm after, but I'm not sure how to go on from there:
doc = Nokogiri::HTML(html)
if doc.at('div.smallfont:contains("Quote:")') != nil
q = doc.parent
# replace tags of q
# remove first_sibling
end
Does it work ok?
doc = Nokogiri::HTML(html)
if quote = doc.at('div.smallfont:contains("Quote:")')
text = quote.next # gets the ' Words of wisdom'
quote.remove # removes div.smallfont
puts text.parent.replace("<blockquote>#{text}</blockquote>") # replaces wrapping div with blockquote block
end
I'd do it like this:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
smallfont_div = doc.at('.smallfont')
smallfont_div.parent.name = 'blockquote'
smallfont_div.remove
puts doc.to_html
__END__
<div>
<div class="smallfont" >Quote:</div>
Words of wisdom
</div>
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>
Words of wisdom
</blockquote>
</body></html>
The whitespace inside <blockquote> will be gobbled up by the browser when it's displayed, so it's usually not an issue, but some browsers will still show a leading space and/or trailing space.
If you want to cleanup the text node containing "Words of wisdom" then I'd do this instead:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_div.remove
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent.text.strip
Which results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<blockquote>Words of wisdom</blockquote>
</body></html>
Alternately, this will generate the same result:
smallfont_div = doc.at('.smallfont')
smallfont_parent = smallfont_div.parent
smallfont_parent_content = smallfont_div.next_sibling.text
smallfont_parent.name = 'blockquote'
smallfont_parent.content = smallfont_parent_content.strip
What the code is doing should be pretty easy to figure out as Nokogiri's methods are pretty self-explanatory.

Ruby: regex to remove tags if attributes don't have allowed values

I have such a text:
click here!blah-blah-some-text-here-blahclick here!
What's the correct way to remove all <a></a> tags (end everything inside them) if <a href= does NOT have some-good-website?
A possible solution using Nokogiri:
require 'nokogiri'
TEST = 'click here!blah-blah-some-text-here-blahclick here!'
page = Nokogiri::HTML(TEST)
links = page.css("a") # parse all <a></a> elements from content
links.each do |link|
if link['href'] =~ /http:\/\/www\.i-am-hacker\.com\/blah/
link.remove
end
end
puts page # output content for debugging
Output
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>blah-blah-some-text-here-blahclick here!
</body></html>
Useful resource: http://ruby.bastardsbook.com/chapters/html-parsing/
This site helped me a lot understanding how to use nokogiri
If you need to install nokogiri, you can do that by using the following command:
gem install nokogiri

Preserve structure of an HTML page, removing all text nodes

I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div> the above may or may not do what you expect. Alternatively, the following may match your needs:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script> element. I've added more text nodes to show how it handles them.
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)

How to select nodes by matching text

If I have a bunch of elements like:
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
Is there a built-in method in Nokogiri that would get me all p elements that contain the text "Apple"? (The example element above would match, for instance).
Nokogiri can do this (now) using jQuery extensions to CSS:
require 'nokogiri'
html = '
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
'
doc = Nokogiri::HTML(html)
doc.at('p:contains("bar")').text.strip
=> "bar"
Here is an XPath that works:
require 'nokogiri'
doc = Nokogiri::HTML(DATA)
p doc.xpath('//li[contains(text(), "Apple")]')
__END__
<p>A paragraph <ul><li>Item 1</li><li>Apple</li><li>Orange</li></ul></p>
You can also do this very easily with Nikkou:
doc.search('p').text_includes('bar')
Try using this XPath:
p = doc.xpath('//p[//*[contains(text(), "Apple")]]')

Resources