require 'nokogiri'
xml = DATA.read
xml_nokogiri = Nokogiri::XML.parse xml
widgets = xml_nokogiri.xpath("//Widget")
dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
puts dates
__END__
<Widgets>
<Widget>
<Price>42</Price>
<DateAdded>04/22/1989</DateAdded>
</Widget>
<Widget>
<Price>29</Price>
<DateAdded>02/05/2015</DateAdded>
</Widget>
</Widgets>
Notes:
This is a contrived example I cooked up as its very inconvenient to post the actual code because of too many dependencies. Did this as this code is readily testable on copy/paste.
widgets is a Nokogiri::XML::NodeSet object which has two Nokogiri::XML::Elements. Each of which is the xml fragment corresponding to the Widget tag.
I am intending to operate on each of those fragments with xpath again, but use of xpath query that starts with // seems to query from the ROOT of the xml AGAIN not the individual fragment.
Any idea why its so? Was expecting dates to hold the tag of each fragment alone.
EDIT: Assume that the tags have a complicated structure that
relative addressing is not practical (like using
xpath("DateAdded"))
.//DateAdded will give you relative XPath (any nested DateAdded node), as well as simple DateAdded without preceding slashes (immediate child):
- dates = widgets.map { |widget| widget.xpath("//DateAdded").text }
# for immediate children use 'DateAdded'
+ dates = widgets.map { |widget| widget.xpath("DateAdded").text }
# for nested elements use './/DateAdded'
+ dates = widgets.map { |widget| widget.xpath(".//DateAdded").text }
#⇒ [
# [0] "04/22/1989",
# [1] "02/05/2015"
#]
Related
I have a string which looks like the following:
string = " <SET-TOPIC>INITIATE</SET-TOPIC>
<SETPROFILE>
<PROFILE-KEY>predicates_live</PROFILE-KEY>
<PROFILE-VALUE>yes</PROFILE-VALUE>
</SETPROFILE>
<think>
<set><name>first_time_initiate</name>yes</set>
</think>
<SETPROFILE>
<PROFILE-KEY>first_time_initiate</PROFILE-KEY>
<PROFILE-VALUE>YES</PROFILE-VALUE>
</SETPROFILE>"
My objective is to be able to read out each top level that is in caps with the parse. I use a case statement to evaluate what is the top level key, such as <SETPROFILE> but there can be lots of different values, and then run a method that does different things with the contnts of the tag.
What this means is I need to be able to know very easily:
top_level_keys = ['SET-TOPIC', 'SET-PROFILE', 'SET-PROFILE']
when I pass in the key know the full value
parsed[0].value = {:PROFILE-KEY => predicates_live, :PROFILE-VALUE => yes}
parsed[0].key = ['SET-TOPIC']
I currently parse the whole string as follows:
doc = Nokogiri::XML::DocumentFragment.parse(string)
parsed = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
As a result, I only parse and know of the second tag. The values from the first tag do not show up in the parsed variable.
I have control over what the tags are, if that helps.
But I need to be able to parse and know the contents of both tag as a result of the parse because I need to apply a method for each instance of the node.
Note: the string also contains just regular text, both before, in between, and after the XML-like tags.
It depends on what you are going to achieve. The problem is that you are overriding hash keys by new values. The easiest way to collect values is to store them in array:
parsed = doc.search('*').each_with_object({}) do |n, h|
# h[n.name] = n.text :: removed because it overrides values
(h[n.name] ||= []) << n.text
end
I am storing Time objects in XML as strings. I am having trouble figuring out the best way to reinitialize them back. From strings to Time objects in order to perform a subtraction on them.
here is how they are stored in xml
<time>
<category> batin </category>
<in>2014-10-29 18:20:47 -0400</in>
<out>2014-10-29 18:20:55 -0400</out>
</time>
using
t = Time.now
i am accessing them from xml with
doc = Nokogiri::XML(File.open("time.xml"))
nodes = doc.xpath("//time").each do |node|
temp = TimeClock.new
temp.category = node.xpath('category').inner_text
temp.in = node.xpath('in').inner_text.
temp.out = node.xpath('out').inner_text.
#times << temp
end
what is the best way to reconvert them back to Time objects? i do not see a method of Time object that does this. I found that it is possible to convert to a Date object. but that seems to only give me a format of mm/dd/yyyy which is partly what i want.
in need to be able to subtract
<out>2014-10-29 18:20:55 -0400</out>
from
<in>2014-10-29 18:20:47 -0400</in>
the XML will at some point be stored based on dates but i also need the exact time "hh/mm/ss" as well to perform calculations.
any sugguestions?
The time stdlib extends the class with parsing/conversion methods.
require 'time'
Time.parse('2014-10-29 18:20:47 -0400')
I'd do something like:
require 'nokogiri'
require 'time'
doc = Nokogiri::XML(<<EOT)
<xml>
<time>
<category>
<in>2014-10-29 18:20:47 -0400</in>
<out>2014-10-29 18:20:55 -0400</out>
</category>
</time>
</xml>
EOT
times = doc.search('time category').map{ |category|
in_time, out_time = %w[in out].map{ |n| Time.parse(category.at(n).text) }
{
in: in_time,
out: out_time
}
}
times # => [{:in=>2014-10-29 15:20:47 -0700, :out=>2014-10-29 15:20:55 -0700}]
Both the DateTime and Time classes allow parsing of a small variety of date/time formats. Some formats can cause explosions but this one is safe. Use DateTime if the date could be before the Unix epoch.
in_time, out_time = %w[in out].map{ |n| Time.parse(category.at(n).text) }
Looking at that in IRB:
>> doc.search('time category').to_html
"<category>\n <in>2014-10-29 18:20:47 -0400</in>\n <out>2014-10-29 18:20:55 -0400</out>\n</category>"
doc.search('time category') returns a NodeSet of all <category> nodes.
>> %w[in out]
[
[0] "in",
[1] "out"
]
returns an array of strings.
Time.parse(category.at(n).text)
returns the n node under the <category> node, where n is first 'in', then 'out'.
I am receiving xml-serialised RDF (as part of XMP media descriptions in case that is relevent), and processing in Ruby. I am trying to work with rdf gem, although happy to look at other solutions.
I have managed to load and query the most basic data, but am stuck when trying to build a query for items which contain sequences and bags.
Example XML RDF:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
<rdf:Description rdf:about='' xmlns:dc='http://purl.org/dc/elements/1.1/'>
<dc:date>
<rdf:Seq>
<rdf:li>2013-04-08</rdf:li>
</rdf:Seq>
</dc:date>
</rdf:Description>
</rdf:RDF>
My best attempt at putting together a query:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new( :subject => { RDF::DC11.date => :date } )
results = date_query.execute(graph)
results.map { |result| { result.subject.to_s => result.date.inspect } }
=> [{"test.rdf"=>"#<RDF::Node:0x3fc186b3eef8(_:g70100421177080)>"}]
I get the impression that my results at this stage ("query solutions"?) are a reference to the rdf:Seq container. But I am lost as to how to progress. For the example above, I'd expect to end up, eventually, with an array ["2013-04-08"].
When there is incoming data without the rdf:Seq and rdf:li containers, I am able to extract the strings I want using RDF::Query, following examples at http://rdf.rubyforge.org/RDF/Query.html - unfortunately I cannot find any examples of more complex queries or RDF structures processed in Ruby.
Edit: In addition, when I try to find appropriate methods to use with the RDF::Node object, I cannot see any way to explore any further relations it may have:
results[0].date.methods - Object.methods
=> [:original, :original=, :id, :id=, :node?, :anonymous?, :unlabeled?, :labeled?, :to_sym, :resource?, :constant?, :variable?, :between?, :graph?, :literal?, :statement?, :iri?, :uri?, :valid?, :invalid?, :validate!, :validate, :to_rdf, :inspect!, :type_error, :to_ntriples]
# None of the above leads AFAICS to more data in the graph
I know how to get the same data in xpath (well, at least provided we always get the same paths in the serialisation), but feel it is not the best query language to use in this case (it's my backup plan, however, if it turns out too complex to implement an RDF-query solution)
I think you're correct when saying "my results at this stage ("query solutions"?) are a reference to the rdf:Seq container". RDF/XML is a really horrible serialisation format, instead think of the data as a graph. Here a picture of an RDF:Bag. RDF:Seq works the same and the #students in the example is analogous to the #date in your case.
So to get to the date literal, you need to hop one node further in the graph. I'm not familiar with the syntax of this Ruby library, but something like:
require 'rdf'
require 'rdf/rdfxml'
require 'rdf/vocab/dc11'
graph = RDF::Graph.load( 'test.rdf' )
date_query = RDF::Query.new({
:yourThing => {
RDF::DC11.date => :dateSeq
},
:dateSeq => {
RDF.type => RDF.Seq,
RDF._1 => :dateLiteral
}
})
date_query.execute(graph).each do |solution|
puts "date=#{solution.dateLiteral}"
end
Of course, if you expect the Seq to actually to contain multiple dates (otherwise it wouldn't make sense to have a Seq), you will have to match them with RDF._1 => :dateLiteral1, RDF._2 => :dateLiteral2, RDF._3 => :dateLiteral3 etc.
Or for a more generic solution, match all the properties and objects on the dateSeq with:
:dateSeq => {
:property => :dateLiteral
}
and then filter out the case where :property ends up being RDF:type while :dateLiteral isn't actually the date but RDF:Seq. Maybe the library has also a special method to get all the Seq's contents.
How can I remove an element when parsing XML with Ox?
Ox has an append method - (Object) <<(node) but doesn't seem to have a - (Element) remove method. Nokogiri has a remove function, does Ox have an equivalent?
http://www.ohler.com/ox/Ox/Element.html
Consider this document:
doc = Ox::Document.new(:version => '1.0')
top = Ox::Element.new('top')
top[:name] = 'sample'
doc << top
Now you can observe:
doc.nodes.class => Array
Your nodes are just a regular ruby array.
And thus you have all the Enumerable functionality combined with the Array facilities of Ruby.
To delete the element we've created above, you can do this:
doc.nodes.delete top
Or an index-based removal if that's what you need:
doc.nodes.delete_at 0
Hope this helps
I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[#data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/#*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "#*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("#*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/#*[starts-with(name(), 'data-')]"
This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes while ignoring any attribue that does not start with a data-.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)