Nokogiri to Find All Data Attrabutes Using a Wildcard - ruby

I'd like to strip all the data attributes from img tags while looping through a document. I've tried a few options using has_attribute? and xpath, none have returned true.
article.css('img').each do |img|
# There is a `data` element
img.has_attribute?("data-lazy-srcset") # true
# But I only get `false` or empty arrays when trying wildcards
img.has_attribute?('data-*') # false
img.has_attribute?("//*[#*[contains(., 'data-')]]") # false
img.has_attribute?("//*[contains(., 'data-')]") # false
img.has_attribute?("//#*[starts-with(name(), 'data-')]") # false
img.xpath("//*[#*[contains(., 'data-')]]") # []
img.xpath("//*[contains(., 'data-')]") # []
end
How do I select all data- attributes on these img tags?

You can search for img tags with an attribute that starts with "data-" using the following:
//img[#*[starts-with(name(),'data-')]]
To break this down:
// - Anywhere in the document
img - img tag
#* - All Attributes
starts-with(name(),'data-') - Attribute's name starts with "data-"
Example:
require 'nokogiri'
doc = Nokogiri::HTML(<<-END_OF_HTML)
<img src='' />
<img data-method='a' src= ''>
<img data-info='b' src= ''>
<img data-type='c' src= ''>
<img src= ''>
END_OF_HTML
imgs = doc.xpath("//img[#*[starts-with(name(),'data-')]]")
puts imgs
# <img data-method="a" src="">
# <img data-info="b" src="">
# <img data-type="c" src="">
or using your desired loop
doc.css('img').select do |img|
img.xpath(".//#*[starts-with(name(),'data-')]").any?
end
#[#<Nokogiri::XML::Element:0x384 name="img" attributes=[#<Nokogiri::XML::Attr:0x35c name="data-method" value="a">, #<Nokogiri::XML::Attr:0x370 name="src">]>,
# #<Nokogiri::XML::Element:0x3c0 name="img" attributes=[#<Nokogiri::XML::Attr:0x398 name="data-info" value="b">, #<Nokogiri::XML::Attr:0x3ac name="src">]>,
# #<Nokogiri::XML::Element:0x3fc name="img" attributes=[#<Nokogiri::XML::Attr:0x3d4 name="data-type" value="c">, #<Nokogiri::XML::Attr:0x3e8 name="src">]>]
UPDATE To remove the attributes:
doc.css('img').each do |img|
img.xpath(".//#*[starts-with(name(),'data-')]").each(&:remove)
end
puts doc.to_s
#<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" #\"http://www.w3.org/TR/REC-html40/loose.dtd\">
#<html>
#<body>
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
# <img src=\"\">
#</body>
#</html>
This can be simplified to doc.xpath("//img/#*[starts-with(name(),'data-')]").each(&:remove)

Related

How to use Nokogiri to get the full HTML without any text content

I'm trying to use Nokogiri to get a page's full HTML but with all of the text stripped out.
I tried this:
require 'nokogiri'
x = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]").each { |a| a.children.remove }
puts y.to_s
This outputs:
<div class="example"></div>
I've also tried running it without the children.remove part:
y = Nokogiri::HTML.parse(x).xpath("//*[not(text())]")
puts y.to_s
But then I get:
<div class="example"><span>Hello</span></div>
But what I actually want is:
<html><body><div class='example'><span></span></div></body></html>
NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How can I access the path of the current document in a Jekyll Tag?

I currently have the following code for a Jekyll caption tag:
# A Liquid tag for Jekyll sites that allows easy creation of captioned
# images like they are in WordPress.
#
# Author: Martin Thoma (info#martin-thoma.de)
# Source: https://github.com/MartinThoma/jekyll-caption-tag
# Version: 1.2
#
# Example usage:
# {% caption align="aligncenter" width="500" alt="WER calculation" text="WER calculation" url="/images/2013/11/WER-calculation.png" %}
#
# Plugin replaces the template above with:
# <div style="width: 510px" class="wp-caption aligncenter">
# <a href="/images/2013/11/WER-calculation.png">
# <img src="/images/2013/11/WER-calculation.png" alt="WER calculation" width="500" height="494" class="size-full">
# </a>
# <p class="wp-caption-text">WER calculation</p>
# </div>
require 'csv'
#require 'dimensions'
module Jekyll
class CaptionTag < Liquid::Tag
def initialize(tag_name, text, tokens)
super
#text = text
#tokens = tokens
end
def parse_attrs(input)
options = { col_sep: '=', row_sep: ' ', quote_char: '"' }
csv = CSV.new input, options
csv.each_with_object({}) do |row, attrs|
attr, value = row
value ||= true
attrs[attr] = value
end
end
def render(context)
#hash = parse_attrs(#text)
if #hash.has_key?('text') && #hash.has_key?('caption')
puts "[Warning]["+context.environments.first["page"]["url"]+"] One caption Liquid tag has both, 'text' and 'caption' attribute. Using 'caption' is better."
end
if #hash.has_key?('title') && #hash.has_key?('caption')
puts "[Warning]["+context.environments.first["page"]["url"]+"] One caption Liquid tag has both, 'title' and 'caption' attribute. Using 'caption' is better."
end
if #hash.has_key?('text') && !#hash.has_key?('caption')
#hash['caption'] = #hash['text']
end
if #hash.has_key?('title') && !#hash.has_key?('caption')
#hash['caption'] = #hash['title']
end
#divWidth = (#hash['width'].to_i+10).to_s
#puts context.inspect
#Dimensions.dimensions(#hash['url'])
"<div style=\"width: #{#divWidth}px\" class=\"wp-caption #{#hash['align']}\">" +
"<a href=\"#{#hash['url']}\">" +
"<img src=\"#{#hash['url']}\" alt=\"#{#hash['text']}\" width=\"#{#hash['width']}\" height=\"#{#hash['height']}\" class=\"#{#hash['class']}\"/>" +
"</a>" +
"<p class=\"wp-caption-text\">#{#hash['caption']}</p>" +
"</div>"
end
end
end
Liquid::Template.register_tag('caption', Jekyll::CaptionTag)
I would like to check the image dimensions. For that, I have to access the image. To do so, I need the path of the current rendered object. Please note that it is not always as simple as basepath + /posts as there might be basepath/author/moose/index.html or similar situations.
So: How can I access the path of the current rendered document in a Jekyll Tag?
I think of the path before site generation. So there should not be basepath/_site in the result.
context.registers[:page]["path"]

Remove all nodes after a specified node [duplicate]

This question already has answers here:
Nokogiri: Select content between element A and B
(3 answers)
Closed 2 years ago.
I'm grabbing a div of text from a url and would like to remove everything underneath a paragraph which has a backtotop class. I'd seen a traverse snippet of code here on stackoverflow which looks promising, but I can't figure out how to get it incorporated so #el only contains everything up to the first p.backtotop in the div.
my code:
#doc = Nokogiri::HTML(open(url))
#el = #doc.css("div")[0]
end
traverse snippet:
doc = Nokogiri::HTML(code)
stop_node = doc.css("p.backtotop")
doc.traverse do |node|
break if node == stop_node
# else, do whatever, e.g. `puts node.name`
end
Find the div you want.
Find the 'stop' item you want, and then find all the following siblings.
Remove them.
For example:
<body>
<div id="a">
<h2>My Section</h2>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#a')
div.at('.backtotop').xpath('following-sibling::*').remove
puts div
#=> <div id="a">
#=> <h2>My Section</h2>
#=> <p class="backtotop">Back to Top</p>
#=>
#=>
#=> </div>
Here's a more complicated example, where the backtotop item may not be at the root of the div:
<body>
<div id="b">
<h2>Another Section</h2>
<section>
<p class="backtotop">Back to Top</p>
<p>More Content</p>
</section>
<p>Even More Content</p>
</div>
</body>
require 'nokogiri'
doc = Nokogiri::HTML(my_html)
div = doc.at('#b')
n = div.at('.backtotop')
until n==div
n.xpath('following-sibling::*').remove
n = n.parent
end
puts div
#=> <div id="b">
#=> <h2>Another Section</h2>
#=> <section><p class="backtotop">Back to Top</p>
#=>
#=> </section>
#=> </div>
If your HTML is more complicated than the above then please provide an actual sample along with the result you want. This is good advice for any future question you ask.

capturing specific text between tags

The explanation is in the comment. I put it there because is interpreted as bold or something, and it screws up the post.
# I need to capture text that is
# enclosed in tags that are both <b> and
# <i>, but if there is more than one
# text enclosed in <i> in the same <b>
# block, then I only want the text
# enclosed in the first <i> tag, For
# example, for the following line:
#
# <b> <i> Important text here </i>
# irrelevant text everywhere else <i>
# irrelevant text here </i> </b> <b>
# <i> Also Important </i> not important
# <i> not important </i> </b>
#
# I want to retrieve only:
# - Important text here
# - Also Important
#
# I also must not retrieve text inside an
# <h2> block. I have been trying to
# delete the block with nodes.delete(nodes. search('h2')),
# but it doesn't actually delete the h2 block
require "rubygems"
require "nokogiri"
html = <<EOT
<b><i> Important text here </i> more text <i> not important text here </i> </b>
<b> <i> Also Important </i> more text <i> not important </i> </b>
<h2><b> <i> I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b i')
nodes.each { |e| puts e }
# Expected output:
# Important text here
# Also Important
require "nokogiri"
require 'pp'
html = <<EOT
<b><i>Important text here</i>more text<i>not important text here</i></b>
<b><i>Also Important</i>more text<i>not important</i></b>
<h2><b><i>I don't want this text either</i></b></h2>
EOT
doc = Nokogiri::HTML(html)
nodes = doc.search('b')
nodes.each { |e| puts e.children.children.first unless e.parent.name == "h2" }
or with xpath:
nodes = doc.xpath("//../*[local-name() != 'h2']/b/i[1]")
nodes.each { |e| puts e.children.first}

How do I select IDs using xpath in Nokogiri?

Using this code:
doc = Nokogiri::HTML(open("text.html"))
doc.xpath("//span[#id='startsWith_']").remove
I would like to select every span#id starting with 'startsWith_' and remove it. I tried searching, but failed.
Here's an example:
require 'nokogiri'
html = '
<html>
<body>
<span id="doesnt_start_with">foo</span>
<span id="startsWith_bar">bar</span>
</body>
</html>'
doc = Nokogiri::HTML(html)
p doc.search('//span[starts-with(#id, "startsWith_")]').to_xml
That's how to select them.
doc.search('//span[starts-with(#id, "startsWith_")]').each do |n|
n.remove
end
That's how to remove them.
p doc.to_xml
# >> "<span id=\"startsWith_bar\">bar</span>"
# >> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n <span id=\"doesnt_start_with\">foo</span>\n \n</body></html>\n"
The page "XPath, XQuery, and XSLT Functions" has a list of the available functions.
Try this xpath expression:
//span[starts-with(#id, 'startsWith_')]

Resources