How to get nokogiri attribute value? - ruby

My xml contains multiple statements like
<House name="bla"><Room id="bla" name="black" ><blah id="blue" name="brown"></blah></Room></House>
I need to get all the values for the given keyword.
I used nodes = doc.css("[name]") to get the <Room id="bla" name="black" ><blah id="blue" name="brown"></blah></Room>.\
But how do I get the value for a key from this. Is there any easier way to do this?

node_names = doc.css("[name]").map { |node| node['name'] }
for all node names; or for just "black",
black = doc.at_css("[name]")['name']

Related

How to do nokogiri attribute selection?

I have many statements like this in my test.xml file
<House name="bla"><Room id="bla" name="black" ></Room></House>
How do I print all Rooms with name="black". I am using CSS selector but Only House and Room attributes are taken by the selector.
I started with trying to print all name's, doesn't matter House or Room.
nodes = doc.css("name"). But it gives null as the output. So I am not able to proceed.
In CSS you have a syntax for matching elements by an attribute key-val pair:
nodes = doc.css("[name='black']")
For future reference you can also chain attribute selectors
nodes = doc.css(".my-class[name='black'][foo='bar']")
Or omit the val and match any element where the attribute is present:
nodes = doc.css("[name]")

Parse a string with multiple XML-like tags using Ruby

I have a string which looks like the following:
string = " <SET-TOPIC>INITIATE</SET-TOPIC>
<SETPROFILE>
<PROFILE-KEY>predicates_live</PROFILE-KEY>
<PROFILE-VALUE>yes</PROFILE-VALUE>
</SETPROFILE>
<think>
<set><name>first_time_initiate</name>yes</set>
</think>
<SETPROFILE>
<PROFILE-KEY>first_time_initiate</PROFILE-KEY>
<PROFILE-VALUE>YES</PROFILE-VALUE>
</SETPROFILE>"
My objective is to be able to read out each top level that is in caps with the parse. I use a case statement to evaluate what is the top level key, such as <SETPROFILE> but there can be lots of different values, and then run a method that does different things with the contnts of the tag.
What this means is I need to be able to know very easily:
top_level_keys = ['SET-TOPIC', 'SET-PROFILE', 'SET-PROFILE']
when I pass in the key know the full value
parsed[0].value = {:PROFILE-KEY => predicates_live, :PROFILE-VALUE => yes}
parsed[0].key = ['SET-TOPIC']
I currently parse the whole string as follows:
doc = Nokogiri::XML::DocumentFragment.parse(string)
parsed = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
As a result, I only parse and know of the second tag. The values from the first tag do not show up in the parsed variable.
I have control over what the tags are, if that helps.
But I need to be able to parse and know the contents of both tag as a result of the parse because I need to apply a method for each instance of the node.
Note: the string also contains just regular text, both before, in between, and after the XML-like tags.
It depends on what you are going to achieve. The problem is that you are overriding hash keys by new values. The easiest way to collect values is to store them in array:
parsed = doc.search('*').each_with_object({}) do |n, h|
# h[n.name] = n.text :: removed because it overrides values
(h[n.name] ||= []) << n.text
end

Ruby: Extract and operate on partially extracted Nokogiri objects

require 'nokogiri'
xml = DATA.read
xml_nokogiri = Nokogiri::XML.parse xml
widgets = xml_nokogiri.xpath("//Widget")
dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
puts dates
__END__
<Widgets>
<Widget>
<Price>42</Price>
<DateAdded>04/22/1989</DateAdded>
</Widget>
<Widget>
<Price>29</Price>
<DateAdded>02/05/2015</DateAdded>
</Widget>
</Widgets>
Notes:
This is a contrived example I cooked up as its very inconvenient to post the actual code because of too many dependencies. Did this as this code is readily testable on copy/paste.
widgets is a Nokogiri::XML::NodeSet object which has two Nokogiri::XML::Elements. Each of which is the xml fragment corresponding to the Widget tag.
I am intending to operate on each of those fragments with xpath again, but use of xpath query that starts with // seems to query from the ROOT of the xml AGAIN not the individual fragment.
Any idea why its so? Was expecting dates to hold the tag of each fragment alone.
EDIT: Assume that the tags have a complicated structure that
relative addressing is not practical (like using
xpath("DateAdded"))
.//DateAdded will give you relative XPath (any nested DateAdded node), as well as simple DateAdded without preceding slashes (immediate child):
- dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
# for immediate children use 'DateAdded'
+ dates = widgets.map { |widget| widget.xpath("DateAdded").text }
# for nested elements use './/DateAdded'
+ dates = widgets.map { |widget| widget.xpath(".//DateAdded").text }
#⇒ [
# [0] "04/22/1989",
# [1] "02/05/2015"
#]

How to load the xml file from webpage and read particular nodes from xml?

I am planning to load below mentioned xml from the webpage and then want to read particular nodes from it.Filtering condition: if "displayname" attribute contains "isc-asr901a"it should pick the first node and return the attribute "id" value of node ethernetProtocolEndpointExtendedDTO"
<queryResponse type="EthernetProtocolEndpoint">
<entity >
<ethernetProtocolEndpointExtendedDTO id="2283315" displayName="4c2b8aa7[2275273_isc- asr901a,GigabitEthernet0/0]">
<name>GigabitEthernet0/0</name>
<adminStatus>UP</adminStatus>
</ethernetProtocolEndpointExtendedDTO>
</entity>
<entity >
<ethernetProtocolEndpointExtendedDTO id="2283315" displayName="4c2b8aa7[2275273_isc-asr901a,GigabitEthernet0/0]">
<name>GigabitEthernet0/0</name>
<adminStatus>UP</adminStatus>
</ethernetProtocolEndpointExtendedDTO>
</entity>
</queryResponse>
I am planning to do this using ruby. but I am new to ruby. Could someone help me to perform this. by using which parser i can do it easily? I am using below code to perform this but code is not returning any value.
strurl = "https://.."
doc = Nokogiri::HTML(open(strurl))
doc.xpath('//queryResponse/entity/ethernetProtocolEndpointDTO[#displayName="[^"]*isc-asr901a[^"]*]').each do |node|
puts node['id']
end
Thanks,
Chandana
You need to use Nokogiri::XML, not Nokogiri::HTML, since this is an XML. Furthermore, you had a typo in ethernetProtocolEndpointExtendedDTO - you wrote ethernetProtocolEndpointDTO.
Also, you should use contains to find the display names which contain your string:
strurl = "https://.."
doc = Nokogiri::XML(open(strurl))
doc.xpath('//queryResponse/entity/ethernetProtocolEndpointExtendedDTO[contains(#displayName, "isc-asr901a")]').each do |node|
puts node['id']
end
# => 2283315

Extracting HTML5 data attributes from a tag

I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[#data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/#*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "#*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("#*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/#*[starts-with(name(), 'data-')]"
This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes while ignoring any attribue that does not start with a data-.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)

Resources