how to get attribute values using nokogiri - ruby

I have a webpage whose DOM structure I do not know...but i know the text which i need to find in that particular webpage..so in order to get its xpath what i do is :
doc = Nokogiri::HTML(webpage)
doc.traverse { |node|
if node.text?
if node.content == "my text"
path << node.path
end
end
}
puts path
now suppose i get an output like ::
html/body/div[4]/div[8]/div/div[38]/div/p/text()
so that later on when i access this webpage again i can do this ::
doc.xpath("#{path[0]}")
instead of traversing the whole DOM tree everytime i want the text
I want to do some further processing , for that i need to know which of the element nodes in the above xpath output have attributes associated with them and what are their attribute values. how would i achieve that? the output that i want is
#=> output desired
{ p => p_attr_value , div => div_attr_value , div[38] => div[38]_attr_value.....so on }
I am not facing the problem in searching the nodes where "my text" lies.. I wanted to have the full xpath of "my text" node..thts why i did the whole traversal...now after finding the full xpath i want the attributes associated with the each element node that I came across while getting to the "my text" node
constraints are ::I cant use any of the developer tools available in a web browser
PS :: I am newbie in ruby and nokogiri..

To select all attributes of an element that is selected using the XPath expression someExpr, you need to evaluate a new XPath expression:
someExpr/#*
where someExpr must be substituted with the real XPath expression used to select the particular element.
This selects all attributes of all (we assume that's just one) elements that are selected by the Xpath expression someExpr
For example, if the element we want is selected by:
/a/b/c
then all of its attributes are selected by:
/a/b/c/#*

Related

How to iterate on select elements with Xpath with one exception?

I want to iterate over each selector found that contains a specific class in order to retrieve all elements within the divs. This works until it reaches one item containing an ID.
for selector in response.xpath("//div[#class='product-list-entry']"):
My best try to get around this is the following code:
for selector in response.xpath("//div[not(#id) and #class='product-list-entry']"):
Both versions lead to only retrieving 5 result sets instead of the full list.
How can I simply ignore the one with the id and iterate on all others?
This should extract the content of the specific divs (examples : text of the div, content of a span and text of a p element) :
def parse(self, response):
for selector in response.xpath("//div[#id='product-list']"):
content = selector.xpath(".//div[not(#id)]/text()").extract()
content2= selector.xpath(".//div[not(#id)]/span").extract()
content3= selector.xpath(".//div[not(#id)]/p/text()").extract()
content4= ...
print (content,content2,content3,...)

How do I create a child element within a Nokogiri node?

I’m using Rails 4.2.7 with Nokogiri. I’m having trouble creating a child node. I have the following code
general = doc.xpath("//lomimscc:general")
description = Nokogiri::XML::Node.new "lomimscc:description", doc
string = Nokogiri::XML::Node.new "lomimscc:string", doc
string.content = scenario.abstract
string['language'] = 'en'
description << string
general << description
I want the “description” element to be a child element of the “general” element (and similarly I want the “string” element to be a child of the “description” element). However what is happening is that the description element is appearing as a sibling of the general element. How do I make the element appear as a child instead of a sibling?
The tutorials show how to do this in "Creating new nodes", but the simple example is:
require 'nokogiri'
doc = Nokogiri::XML('<root/>')
doc.at('root').add_child('<foo/>')
doc.to_xml # => "<?xml version=\"1.0\"?>\n<root>\n <foo/>\n</root>\n"
Nokogiri makes it easy to build nodes using a string that contains the markup or nodes you want to add.
You should be able to build upon this easily.
This is also noted throughout the Node documentation any place you see "node_or_tags".
When I changed
general = doc.xpath("//lomimscc:general")
to
general = doc.xpath("//lomimscc:general").first
then everything worked as far as creating child nodes.

Nokogiri and Xpath query

I have this Xpath query inside a loop,
//div[#class='listing_content'][#{i}]/div/div/h3/a/text()
I want to process each node individually
The problem it gives the correct nodes, but all of them at once at once.
Also when i > 1, it returns nothing at all?
for i in (1...30)
name = page.xpath("//div[#class='listing_content'][#{i}]/div/div/h3/a/text()")
puts "this is name"
puts name
#Get Business phone
phone = page.xpath("//div[#class='listing_content'][#{i}]//span[#class='business-phone phone']/text()")
puts "this is phone"
puts phone
#Get Business website(if any)
puts "this is website"
website = page.xpath("//div[#class='listing_content'][#{i}]//li[#class='website-feature']//#href")
puts website
end
Also when i > 1, it returns nothing at all?
This is the second most FAQ in XPath:
Use:
(//div[#class='listing_content'])[#{i}]/div/div/h3/a/text()
The cause of the observed behavior is that in XPath the [] has higher precedence (priority) than the // pseudo-operator.
So, in your original expression you specify that every div[#class='listing_content'] element that is the i-th child of its parent should be selected.
However, in the XML document you are working with, every div[#class='listing_content'] happens to be the first (and only) child of its parent -- therefore if i > 1 then nothing is selected.
As in any other language, in order to override the default priority, we must use brackets.

Unable to set InnerText using Html-Agility-Pack

Given an HTML document, I want to identify all the numbers in the document and add custom tags around the numbers.
Right now, i use the following:
HtmlNodeCollection bodyNode = htmlDoc.DocumentNode.SelectNodes("//body");
MatchCollection numbersColl = Regex.Matches(htmlNode.InnerText, <some regex>);
Once I get the numbersColl, I can traverse through each Match and get the index.
However, I can't change the InnerText since it is read-only.
What I need is that if match.Value = 100 and match.Index=25, I want to replace that 25 with
<span isIdentified='true'> 25 </span>
Any help on this will be greatly appreciated. Currently, since I am not able to modify the inner text, I have to modify the InnerHtml but some element might have 25 in it's innerHtml. That should not be touched. But how do I identify whether the number is within
an html tag i.e. < table border='1' > has 1 in the tag.
Here's what I did to work around the read-only property limitation of the InnerText property of a Text node, just select the Parent node of the Text node and note the index of the Text node in the child node collections of the Parent node. Then just do a ReplaceChild(...).
private void WriteText(HtmlNode node, string text)
{
if (node.ChildNodes.Count > 0)
{
node.ReplaceChild(htmlDocument.CreateTextNode(text), node.ChildNodes.First());
}
else
{
node.AppendChild(htmlDocument.CreateTextNode(text));
}
}
In your case I believe you need to create a new Element node that wraps the text into an HtmlElement and then just use it as a replacement of the Text node.
Or even better, see if you can do something like the answer posted here:
Replacing a HTML div InnerText tag using HTML Agility Pack
creating a textnode does not what it should do in this case:
myParentNode.AppendChild(D.CreateTextNode("<script>alert('a');</script>"));
Console.Write(myParentNode.InnerHtml);
The result should be something like
<script....
but it is a working script task even if i add it as "TEXT" not as html. This causes kind of a security issue for me because the text would be a input from a anonymous user.

Extracting HTML5 data attributes from a tag

I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[#data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/#*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "#*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("#*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/#*[starts-with(name(), 'data-')]"
This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes while ignoring any attribue that does not start with a data-.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)

Resources