How to parse XML to CSV where data is in attributes only - ruby

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end

I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.

There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

Related

How to iterate through nested xml elements using Nokogiri

I have an xml file which includes the nested elements below:
<SourceDetails>
<Origin>Origin</Origin>
<Identifier>Identifier</Identifier>
<Version>0</Version>
</SourceDetails>
I have already used the function at_xpath to extract the above xml snippet from an xml file which has been stored in a variable. Is it possible to iterate through this variable and store the contents of nested xml elements using Ruby Nokogiri? If so, how is this done?
I would like to append each element within SourceDetails to another variable followed by a forward slash. For the above example, I would like to get the content in the format Origin/Identifier/0
There is an easy way
require "nokogiri"
xmlFileData = Nokogiri::XML(File.open('./xmlFile.xml'))
dataArr = xmlFileData.at_xpath("//SourceDetails").text.split("\n")
dataArr.delete_at(0)
puts dataArr.join("/").gsub(/(\s+)/, '')
Here's a quick and dirty one. Since I'm not sure how you're storing your variable containing the XML, to be sure I'm getting the actual XML data I actually read the the XML data from a file, which gives us:
require 'nokogiri'
xml = File.open('source_of_xml.xml') { |f| Nokogiri::XML(f) }
values = []
xml.xpath('SourceDetails').each do |elem|
values << elem.text.gsub(/\n/, "").split
end
p values.first.join("/") #assing this to variable you want.
# => "Origin/Identifier/0"
Does this help or guide you in anyway?

Nokogiri - Checking if the value of an xpath exists and is blank or not in Ruby

I have an XML file, and before I process it I need to make sure that a certain element exists and is not blank.
Here is the code I have:
CSV.open("#{csv_dir}/products.csv","w",{:force_quotes => true}) do |out|
out << headers
Dir.glob("#{xml_dir}/*.xml").each do |xml_file|
gdsn_doc = GDSNDoc.new(xml_file)
logger.info("Processing xml file #{xml_file}")
:x
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
row = []
headers.each do |col|
row << product[col]
end
out << row
end
end
end
The following code is not working to find the "description" element and to check whether it is blank or not:
#desc_exists = #gdsn_doc.xpath("//productData/description")
if !#desc_exists.empty?
Here is a sample of the XML file:
<productData>
<description>Chocolate biscuits </description>
<productData>
This is how I have defined the class and Nokogiri:
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
The code had to be moved up to an earlier stage, where Nokogiri was initialised. It doesn't get runtime errors, but it does let XML files with blank descriptions get through and it shouldn't.
class GDSNDoc
def initialize(xml_file)
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
#doc.remove_namespaces!
desc_exists = #doc.xpath("//productData/descriptions")
if !desc_exists.empty?
You are creating your instance like this:
gdsn_doc = GDSNDoc.new(xml_file)
then use it like this:
#desc_exists = #gdsn_doc.xpath("//productData/description")
#gdsn_doc and gdsn_doc are two different things in Ruby - try just using the version without the #:
#desc_exists = gdsn_doc.xpath("//productData/description")
The basic test is to use:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<productData>
<description>Chocolate biscuits </description>
<productData>
EOT
# using XPath selectors...
doc.xpath('//productData/description').to_html # => "<description>Chocolate biscuits </description>"
doc.xpath('//description').to_html # => "<description>Chocolate biscuits </description>"
xpath works fine when the document is parsed correctly.
I get an error "undefined method 'xpath' for nil:NilClass (NoMethodError)
Usually this means you didn't parse the document correctly. In your case it's because you're not using the right variable:
gdsn_doc = GDSNDoc.new(xml_file)
...
#desc_exists = #gdsn_doc.xpath("//productData/description")
Note that gdsn_doc is not the same as #gdsn_doc. The later doesn't appear to have been initialized.
#doc = File.open(xml_file) {|f| Nokogiri::XML(f)}
While that should work, it's idiomatic to write it as:
#doc = Nokogiri::XML(File.read(xml_file))
File.open(...) do ... end is preferred if you're processing inside the block and want Ruby to automatically close the file. That isn't necessary when you're simply reading then passing the content to something else for processing, hence the use of File.read(...) which slurps the file. (Slurping isn't necessary a good practice because it can have scalability problems, but for reasonable sized XML/HTML it's OK because it's easier to use DOM-based parsing than SAX.)
If Nokogiri doesn't raise an exception it was able to parse the content, however that still doesn't mean the content was valid. It's a good idea to check
#doc.errors
to see whether Nokogiri/libXML had to do some fix-ups on the content just to be able to parse it. Fixing the markup can change the DOM from what you expect, making it impossible to find a tag based on your assumptions for the selector. You could use xmllint or one of the XML validators to check, but Nokogiri will still have to be happy.
Nokogiri includes a command-line version nokogiri that accepts a URL to the document you want to parse:
nokogiri http://example.com
It'll open IRB with the content loaded and ready for you to poke at it. It's very convenient when debugging and testing. It's also a decent way to make sure the content actually exists if you're dealing with HTML containing DHTML that loads parts of the page dynamically.

Parsing a simple XML-like string with adjacent nodes

I'm using the engtagger gem to classify a sentence according to its parts of speech. The output I get is as follows:
puts text
# => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
I would have expected the gem to give me an array, but I guess I'll have to coerce this into an array myself.
What I'm eventually trying to get is a nested array something like this:
[["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]
However I'm not really sure how to approach this with Nokogiri (or another parser library). Here's what I've tried:
(byebug) doc = Nokogiri::XML(text)
#<Nokogiri::XML::Document:0x3fd400286e78 name="document" children=[#<Nokogiri::XML::Element:0x3fd400286900 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd400286464 "My">]>]>
(byebug) Nokogiri.parse(text)
#<Nokogiri::XML::Document:0x3fd40028cd50 name="document" children=[#<Nokogiri::XML::Element:0x3fd40028c7d8 name="nnp" children=[#<Nokogiri::XML::Text:0x3fd40028c378 "My">]>]>
So I've tried two different Nokogiri methods, but both are only showing the first node. How can I get the rest of the adjacent nodes as well?
Alternatively, how can I get the engtagger call to return an array? In the docs, I didn't find an example of how to return an array with all tags, only arrays with one specific kind of tag.
The main thing is that well-formed XML should have a root node. You were receiving the very first node only because it was treated as the root (that said, the topmost) node and as it was closed, Nokogiri considered the XML document to be ended.
Nokogiri::XML("<root>#{text}</root>").
children.first. # get root node
children.map { |e| [e.text, e.name] }. # map to what’s needed
reject { |e| e.last == 'text' } # filter out garbage
That filtering might be more semantically correct:
Nokogiri::XML("<root>#{text}</root>").
children.first.
children.reject { |e| Nokogiri::XML::Text === e }.
map { |e| [e.text, e.name] }
The problem is you're parsing the fragment incorrectly:
require 'nokogiri'
doc = Nokogiri::XML.fragment("<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>")
doc.to_xml # => "<nnp>My</nnp> <nn>name</nn> <vbz>is</vbz> <nnp>Max</nnp>"
Nokogiri wants valid XML, but you can get it to accept partial XML chunks using fragment.
At that point you're able to do:
doc.children.each_with_object([]){ |n, a| a << [n.text, n.name] unless n.text? }
# => [["My", "nnp"], ["name", "nn"], ["is", "vbz"], ["Max", "nnp"]]

Render span-level string using Kramdown

I know that I can parse and render an HTML document with Kramdown in ruby using something like
require 'kramdown'
s = 'This is a _document_'
Kramdown::Document.new(s).to_html
# '<p>This is a <i>document</i></p>'
In this case, the string s may contain a full document in markdown syntax.
What I want to do, however, is to parse s assuming that it only contains span-level markdown syntax, and obtain the rendered html. In particular there should be no <p>, <blockquote>, or, e.g., <table> in the rendered html.
s = 'This is **only** a span-level string'
# .. ??? ...
# 'This is <b>only</b> a span-level string'
How can I do this?
I would post-process the output with the sanitize gem.
require 'sanitize'
html = Kramdown::Document.new(s).to_html
output = Sanitize.fragment(html, elements:['b','i','em'])
The elements are a whitelist of allowed tags, just add all the tags you want. The gem has a set of predefined whitelists, but none match exactly what you're looking for. (BTW, if you want a list of all the HTML5 elements allowed in a span, see the WHATWG's list of "phrasing content").
I know this wasn't tagged rails, but for the benefit of readers using Rails: use the built-in sanitize helper.
You can create a custom parser, and empty its internal list of block-level parsers.
class Kramdown::Parser::SpanKramdown < Kramdown::Parser::Kramdown
def initialize(source, options)
super
#block_parsers = []
end
end
Then you can use it like this:
text = Kramdown::Document.new(text, :input => 'SpanKramdown').to_html
This should do what you want "the right way".

Nokogiri leaving HTML entities untouched

I want Nokogiri to leave HTML entities untouched, but it seems to be converting the entities into the actual symbol. For example:
Nokogiri::HTML.fragment('<p>®</p>').to_s
results in: "<p>®</p>"
Nothing seems to return the original HTML back to me.
The .inner_html, .text, .content methods all return '®' instead of '®'
Is there a way for Nokogiri to leave these HTML entities untouched?
I've already searched stackoverflow and found similar questions, but nothing exactly like this one.
Not an ideal answer, but you can force it to generate entities (if not nice names) by setting the allowed encoding:
#encoding: UTF-8
require 'nokogiri'
html = Nokogiri::HTML.fragment('<p>®</p>')
puts html.to_html #=> <p>®</p>
puts html.to_html( encoding:'US-ASCII' ) #=> <p>®</p>
It would be nice if Nokogiri used 'nice' names of entities where defined, instead of always using the terse hexadecimal entity, but even that wouldn't be 'preserving' the original.
The root of the problem is that, in HTML, the following all describe the exact same content:
<p>®</p>
<p>®</p>
<p>®</p>
<p>®</p>
If you wanted the to_s representation of a text node to be actually ® then the markup describing that would really be: <p>&reg;</p>.
If Nokogiri was to always return the same encoding per character as was used to enter the document it would need to store each character as a custom node recording the entity reference. There exists a class that might be used for this (Nokogiri::XML::EntityReference):
require 'nokogiri'
html = Nokogiri::HTML.fragment("<p>Foo</p>")
html.at('p') << Nokogiri::XML::EntityReference.new( html.document, 'reg' )
puts html
#=> <p>Foo®</p>
However, I can't find a way to cause these to be created during parsing using Nokogiri v1.4.4 or v1.5.0. Specifically, the presence or absence of Nokogiri::XML::ParseOptions::NOENT during parsing does not appear to cause one to be created:
require 'nokogiri'
html = "<p>Foo®</p>"
[ Nokogiri::XML::ParseOptions::NOENT,
Nokogiri::XML::ParseOptions::DEFAULT_HTML,
Nokogiri::XML::ParseOptions::DEFAULT_XML,
Nokogiri::XML::ParseOptions::STRICT
].each do |parse_option|
p Nokogiri::HTML(html,nil,'utf-8',parse_option).at('//text()')
end
#=> #<Nokogiri::XML::Text:0x810cca48 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc624 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cc228 "Foo\u00AE">
#=> #<Nokogiri::XML::Text:0x810cbe04 "Foo\u00AE">

Resources