How to select an element using Nokogiri - ruby

Given the following XML, I want to get the value "0123456" for Name="Cat":
xml.xpath '//Custom[Name="Cat"]'
Gives me the first custom, which is correct, but I only want the "Value" not the entire Custom node.
<body>
<Custom>
<count>1</count>
<Name>Cat</Name>
<Value>0123456</Value>
</Custom>
<Custom>
<count>2</count>
<Name>Dog</Name>
<Value>9876543</Value>
</Custom>
<body>

I only want the "Value" not the entire Custom node.
So just go on writing the path:
//Custom[Name="Cat"]/Value

I prefer to use CSS selectors over XPath, for readability, as usually CSS contains less visual noise:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<Custom>
<count>1</count>
<Name>Cat</Name>
<Value>0123456</Value>
</Custom>
<Custom>
<count>2</count>
<Name>Dog</Name>
<Value>9876543</Value>
</Custom>
<body>
EOT
foo = doc.search('name:contains("Cat")').map{ |node|
node.next_element.text
}
foo # => ["0123456"]
This works because Nokogiri contains some of the jQuery CSS extensions, resulting in some useful additions.

To get the value element text you need to set the xpath as below:
doc = Nokogiri::HTML(<<EOT)
<body>
<Custom>
<count>1</count>
<Name>Cat</Name>
<Value>0123456</Value>
</Custom>
<Custom>
<count>2</count>
<Name>Dog</Name>
<Value>9876543</Value>
</Custom>
<body>
EOT
val=doc.xpath("//Custom[Name='Cat']/Value").text()
val => "0123456"

Related

How to get Xpath for the following?

I have a xml file like following
<topic>
<title>Abstract
</title>
<body>
<p>
abstract data
</p>
</body>
</topic>
<topic>
<title>Keywords</title>
<body>
<p>
keywords data
</p>
</body>
</topic>
I have to check if title is "Keywords" than show the <p>text in </p>.
can anyone help me to get the exact xpath for this?
Thanks in advance
Try this one and let me know the result:
//title[text()="Keywords"]/following::p
or
//topic[title[text()="Keywords"]]//p
//title[text()="Keywords"]/body/p
for text only
//title[text()="Keywords"]/body/p/text()
please avoid double slash "//" and following, it will travel all the P tag
Try this below xpath
//title[text()="Keywords"]/following::p
Explanation of xpath:- Start your xpath with <title> along with text method and move ahead to the <p> tag using the following keyword.

How to wrap Nokogiri nodeset in ONE span

So my goal is to wrap all paragraphs after the initial paragraph within a span. I'm trying to figure out how to wrap a nodeset within a single span and .wrap() wraps each node in its own span. As in want:
<p>First</p>
<p>Second</p>
<p>Third</p>
To become:
<p>First</p>
<span>
<p>Second</p>
<p>Third</p>
</span>
Any sample code to help? Thanks!
I'd do as below :
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<-html)
<p>First</p>
<p>Second</p>
<p>Third</p>
html
nodeset = doc.css("p")
new_node = Nokogiri::XML::Node.new('span',doc)
new_node << nodeset[1..-1]
nodeset.first.after(new_node)
puts doc.to_html
# >> <p>First</p><span><p>Second</p>
# >> <p>Third</p></span>
# >>
I'd do it something like this:
require 'nokogiri'
html = '<p>First</p>
<p>Second</p>
<p>Third</p>
'
doc = Nokogiri::HTML(html)
paragraphs = doc.search('p')[1..-1].unlink
doc.at('p').after('<span>')
doc.at('span').add_child(paragraphs)
puts doc.to_html
Which results in HTML looking like:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>First</p>
<span><p>Second</p>
<p>Third</p></span>
</body></html>
To give you an idea what's happening, here's a more verbose output showing intermediate changes to the doc:
paragraphs = doc.search('p')[1..-1].unlink
paragraphs.to_html
# => "<p>Second</p><p>Third</p>"
doc.at('p').after('<span>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span></span>\n\n</body></html>\n"
doc.at('span').add_child(paragraphs)
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>First</p>\n<span><p>Second</p>\n<p>Third</p></span>\n\n</body></html>\n"
Looking at the initial HTML, I'm not sure the question asked is going to work well for normal, everyday HTML, however, if you are absolutely sure it'll never change from the
<p>...</p>
<p>...</p>
<p>...</p>
layout then you should be OK. Any answer based on the initial, sample, HTML, will blow up miserably if the HTML really is something like:
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
...
<div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>

Sitemesh 3. Output all body tag attributes to the layout

I have a page with:
<body class="auth_loginForm" controller="auth" action="loginForm" >
and a sitemesh layout with body:
<body iamalayout="true" >
<sitemesh:write property='body'/>
</body>
Is there a way to make the attributes of the body appear on the final page?
That is the final body tag would look like:
<body iamalayout="true" class="auth_loginForm" controller="auth" action="loginForm" >
Try setting your layout body tag like this:
<body iamalayout="true" <decorator:getProperty property="body.class" /> <decorator:getProperty property="body.controller" /> <decorator:getProperty property="body.action" /> >
or if it's not working right try to append writeEntireProperty="true" for each decorator:getProperty tag, like
<decorator:getProperty property="body.class" writeEntireProperty="true" />
Hope it helps.
http://wiki.sitemesh.org/wiki/display/sitemesh/Tag+References

How to parse url in <a alt="url attribute"">

I have html code on a site:
<a alt="Кроссовки adidas. Цвет черный. Категории: Женская обувь, Лучшие отзывы, Кеды, кроссовки, ботинки. Вид 3."
class="enabledZoom MagicThumb-swap" href="http://img2.site.ru/big/120000/129102-3.jpg" rel="zoom-id:Azoom;zoom-width:450;zoom-height:598;zoom-distance:10;zoom-position:right;opacity:50;"
rev="http://img2.site.ru/large/120000/129102-3.jpg" style="outline: 0px; " id="mt-1334303054133">
<img src="http://img2.site.ru/tm/120000/129102-3.jpg" class=""></a>
How to extract "http://img2.site.ru/large/120000/129102-3.jpg" with nokogiri gem?
P.S. Nokogiri is parsing element :
[#<Nokogiri::XML::Element:0x42c1ad8 name="a" attributes=[#<Nokogiri::XML::Attr:0x42c1a7e name="alt" value="Кроссовки adidas. Цвет черный. Категории: Женская обувь, Лучшие отзывы, Кеды, кроссовки, ботинки. Вид 1.">, #<Nokogiri::XML::Attr:0x42c1a74 name="class" value="enabledZoom">, #<Nokogiri::XML::Attr:0x42c1a6a name="href" value="http://img2.site.ru/big/120000/129102-1.jpg">, #<Nokogiri::XML::Attr:0x42c1a60 name="rel" value="zoom-id:Azoom;zoom-width:450;zoom-height:598;zoom-distance:10;zoom-position:right;opacity:50;">, #<Nokogiri::XML::Attr:0x42c1a4c name="rev" value="http://img2.site.ru/large/120000/129102-1.jpg">] children=[#<Nokogiri::XML::Element:0x42c0ee4 name="img" attributes=[#<Nokogiri::XML::Attr:0x42c0e94 name="src" value="http://img2.site.ru/tm/120000/129102-1.jpg">, #<Nokogiri::XML::Attr:0x42c0e8a name="class" value="current">]>]>]
You can use the at method if you know the <img> tag you want is in the first <a> tag:
doc.at('a img')['src'] => "http://img2.site.ru/tm/120000/129102-3.jpg"
If it's not, then you'll need to isolate the <a> or the <img>. I'd probably go after the <a id="..."> using something like:
doc.at('a#mt-1334303054133 img')['src'] => "http://img2.site.ru/tm/120000/129102-3.jpg"
If there are multiple <a> or <img> tags then your sample isn't good enough and we'd need more information about the HTML you're receiving.

xpath syntax for semi-joins

I know that I can use xpath to perform joins using the "|" operator. Is there a way to perform semi-joins in xpath like for example:
book[author = article/author]/title
If semi-joins exist, what would the output of the query above look like. Does it just output the title element of each book that has an author who also authored an article?
Maybe you want //book[author = //article/author]/title. With your current attempt book[author = article/author] the article elements would need to be children of the book element which does not seem likely.
The given query would return the title of each book that contains an article that has been authored by that book's author. Thus, in the context of books below, the only thing returned would be the title element with the text "title 0".
<books>
<book>
<title>Title 0</title>
<author>Petri, M</author>
<article>
<title>Title 1</title>
<author>Petri, M</author>
</article>
<article>
<title>Title 2</title>
<author>Butcher, P</author>
</article>
</book>
<book>
<title>Title 3</title>
<author>Butcher, P</author>
<article>
<title>Title 4</title>
<author>Petri, M</author>
</article>
</book>
</books>

Resources