Get all attributes for elements in XML file - ruby

I'm trying to parse a file and get all of the attributes for each <row> tag in the file. The file looks generally like this:
<?xml version="1.0" standalone="yes"?>
<report>
<table>
<columns>
<column name="month"/>
<column name="campaign"/>
<!-- many columns -->
</columns>
<rows>
<row month="December 2009" campaign="Campaign #1"
adgroup="Python" preview="Not available"
headline="We Write Apps in Python"
and="many more attributes here" />
<row month="December 2009" campaign="Campaign #1"
adgroup="Ruby" preview="Not available"
headline="We Write Apps in Ruby"
and="many more attributes here" />
<!-- many such rows -->
</rows></table></report>
Here is the full file: http://pastie.org/7268456#2.
I've looked at every tutorial and answer I can find on various help boards but they all assume the same thing- I'm searching for one or two specific tags and just need one or two values for those tags. I actually have 18 attributes for each <row> tag and I have a mysql table with a column for each of the 18 attributes. I need to put the information into an object/hash/array that I can use to insert into the table with ActiveRecord/Ruby.
I started out using Hpricot; you can see the code (which is not relevant) in the edit history of this question.

require 'nokogiri'
doc = Nokogiri.XML(my_xml_string)
doc.css('row').each do |row|
# row is a Nokogiri::XML::Element
row.attributes.each do |name,attr|
# name is a string
# attr is a Nokogiri::XML::Attr
p name => attr.value
end
end
#=> {"month"=>"December 2009"}
#=> {"campaign"=>"Campaign #1"}
#=> {"adgroup"=>"Python"}
#=> {"preview"=>"Not available"}
#=> {"headline"=>"We Write Apps in Python"}
#=> etc.
Alternatively, if you just want an array of hashes mapping attribute names to string values:
rows = doc.css('row').map{ |row| Hash[ row.attributes.map{|n,a| [n,a.value]} ] }
#=> [
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Python", … },
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Ruby", … },
#=> …
#=> ]
The Nokogiri.XML method is the simplest way to parse an XML string and get a Nokogiri::Document back.
The css method is the simplest way to find all the elements with a given name (ignoring their containment hierarchy and any XML namespaces). It returns a Nokogiri::XML::NodeSet, which is very similar to an array.
Each Nokogiri::XML::Element has an attributes method that returns a Hash mapping the name of the attribute to a Nokogiri::XML::Attr object containing all the information about the attribute (name, value, namespace, parent element, etc.)

Related

Create non-self-closed empty tag with Nokogiri

When I try to create an XML document with Nokogiri::XML::Builder:
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value})
end
I get the following XML tag:
<my_tag key="value"/>
It is self-closed, but I need the full form:
<my_tag key="value"></my_tag>
When I pass a value inside the node (or even a space):
xml.my_tag("content", key: :value)
xml.my_tag(" ", key: :value)
It generates the full tag:
<my_tag key="value">content</my_tag>
<my_tag key="value"> </my_tag>
But if I pass either an empty string or nil, or even an empty block:
xml.my_tag("", key: :value)
It generates a self-closed tag:
<my_tag key="value"/>
I believe there should be some attribute or something else that helps me but simple Googling didn't find the answer.
I found a possible solution in "Building blank XML tags with Nokogiri?" but it saves all tags as non-self-closed.
You can use Nokogiri's NO_EMPTY_TAGS save option. (XML calls self-closing tags empty-element tags.)
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value})
end
puts builder.to_xml(save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS)
<?xml version="1.0"?>
<my_tag key="value"></my_tag>
Each of the options is represented in a bit, so you can mix and match the ones you want. For example, setting NO_EMPTY_TAGS by itself will leave your XML on one line without spacing or indentation. If you still want it formatted for humans, you can bitwise or (|) it with the FORMAT option.
builder = Nokogiri::XML::Builder.new do |xml|
xml.my_tag({key: :value}) do |my_tag|
my_tag.nested({another: :value})
end
end
puts builder.to_xml(
save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS
)
puts
puts builder.to_xml(
save_with: Nokogiri::XML::Node::SaveOptions::NO_EMPTY_TAGS |
Nokogiri::XML::Node::SaveOptions::FORMAT
)
<?xml version="1.0"?>
<my_tag key="value"><nested another="value"></nested></my_tag>
<?xml version="1.0"?>
<my_tag key="value">
<nested another="value"></nested>
</my_tag>
There are also a handful of DEFAULT_* options at the end of the list that already combine options into common uses.
Your update mentions "it saves all tags as non-self-closed", as if perhaps you only want this single tag instance to be non-self-closed, and the rest to self close. Nokogiri won't produce an inconsistent document like that, but if you must, you can concatenate some XML strings together that you built with different options.

XML parsing elements and element attributes into array

I am trying to parse some XML into an array. Here is a chunk of the XML I am parsing:
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<par_status>Y</par_status>
<Quality>
<GPRO_status>N</GPRO_status>
<ERX_status>N</ERX_status>
</Quality>
<Profile_Spec_list>
<Spec>08</Spec>
</Profile_Spec_list>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
...
</Group>
</Group_add>
I am currently able to get the attribute of "Group" and the text within "org_legal_name" and have them added to an array with the code below.
def parse(input_file, output_array)
puts "Parsing #{input_file} data. Please wait..."
doc = Nokogiri::XML(File.read(input_file))
doc.xpath("//Group").each do |group|
["org_legal_name"].each do |name|
output_array << [group["org_pac_id"], group.at(name).inner_html]
end
end
end
I would like to add the location "adrs_id" to the output_array as well, but can't seem to figure that part out.
Example output:
["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"]
["0000000002", "NAME OF GROUP 2", "OR974772594SP2280XRDXX301"]
Starting with:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
</xml>
EOT
Based on your XML I'd use:
array = []
array << doc.at('org_legal_name').text
array << doc.at('Location')['adrs_id']
array # => ["NAME OF GROUP", "OR974772594SP2280XRDXX300"]
If the XML is more complex, which I suspect it is, then we need an accurate, minimal, example of it.
Based on the updated XML, (which is still suspicious), here's what I'd use. Notice that I stripped out information that isn't germane to the question to reduce the XML to the minimal needed:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
<org_legal_name>NAME OF ANOTHER GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX301">
<other_tags>xx</other_tags>
</Location>
</Group>
</Group_add>
</xml>
EOT
data = doc.search('Group').map do |group|
[
group['org_pac_id'],
group.at('org_legal_name').text,
group.at('Location')['adrs_id']
]
end
Which results in:
data # => [["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"], ["0000000002", "NAME OF ANOTHER GROUP", "OR974772594SP2280XRDXX301"]]
Think of the group variable being passed into the block as a placeholder. From that node it's easy to look downward into the DOM, and grab things that apply to only that particular node.
Note that I'm using CSS instead of XPath selectors. They're easier to read and usually work fine. Sometimes we need the added functionality of XPath, and sometimes Nokogiri's use of jQuery's CSS accessors give us things that are useful.

Target text without tags using Nokogiri

I have some very bare HTML that I'm trying to parse using Nokogiri (on Ruby):
<span>Address</span><br />
123 Main Street<br />
Sometown<br />
<span>Telephone</span><br />
212-555-555<br />
<span>Hours</span><br />
M-F: 8:00-21:00<br />
Sat-Sun: 8:00-21:00<br />
<hr />
The only tag I have is a surrounding <div> for the page content. Each of the things I want is preceded by a <span>Address</span> type tag. It can be followed by another span or a hr at the end.
I'd like to end up with the address ("123 Main Street\nSometown"), phone number ("212-555-555") and opening hours as separate fields.
Is there a way to get the information out using Nokogiri, or would it be easier to do this with regular expressions?
Using Nokogiri and XPath you could do something like this:
def extract_span_data(html)
doc = Nokogiri::HTML(html)
doc.xpath("//span").reduce({}) do |memo, span|
text = ''
node = span.next_sibling
while node && (node.name != 'span')
text += node.text
node = node.next_sibling
end
memo[span.text] = text.strip
memo
end
end
extract_span_data(html_string)
# {
# "Address" => "123 Main Street\nSometown",
# "Telephone" => "212-555-555",
# "Hours" => "M-F: 8:00-21:00\n Sat-Sun: 8:00-21:00"
# }
Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad ideaTM.)
I was thinking (rather learning) about xpath:
d.xpath("span[2]/preceding-sibling::text()").each {|i| puts i}
# 123 Main Street
# Sometown
d.xpath("a/text()").text
# "212-555-555"
d.xpath("span[3]/following::text()").text.strip
# "M-F: 8:00-21:00 Sat-Sun: 8:00-21:00"
The first starts with second span and select text() which is before.
You can try another approach here - start with first span, select text() and end up with predicate which checks for next span.
d.xpath("span[1]/following::text()[following-sibling::span]").each {|i| puts i}
# 123 Main Street
# Sometown
If the document has more spans, you can start with the right ones:
span[x] could be substituted by span[contains(.,'text-in-span')]
span[3] == span[contains(.,'Hours')]
Correct me, if something is really wrong.

How to delete an XML element according to value of one of its children?

I have an xml element looking something like this:
<Description>
<ID>1234</ID>
<SubDescription>
<subID>4501</subID>
</SubDescription>
<SubDescription>
<subID>4502</subID>
</SubDescription>
</Description>
How can I delete the entire "Description" element according to the value of its "ID" child?
You can use the following xpath to select Description nodes that contain an ID node with value 1234:
//Description[./ID[text()='1234']]
So to remove the node, you can do:
doc.xpath("//Description[./ID[text()='1234']]").remove
Example:
require 'nokogiri'
str = %q{
<root>
<Description>
<ID>2222</ID>
<SubDescription>
<subID>4501</subID>
</SubDescription>
<SubDescription>
<subID>4502</subID>
</SubDescription>
</Description>
<Description>
<ID>1234</ID>
<SubDescription>
<subID>4501</subID>
</SubDescription>
<SubDescription>
<subID>4502</subID>
</SubDescription>
</Description>
</root>
}
doc = Nokogiri::XML(str)
doc.xpath("//Description[./ID[text()='1234']]").remove
puts doc
#=> <root>
#=> <Description>
#=> <ID>2222</ID>
#=> <SubDescription>
#=> <subID>4501</subID>
#=> </SubDescription>
#=> <SubDescription>
#=> <subID>4502</subID>
#=> </SubDescription>
#=> </Description>
#=></root>
As you can see, the desired description node is removed.
I personally would use the solution by #JustinKo, albeit with the simpler XPath:
doc.xpath("//Description[ID='1234']").remove
However, if crafting XPath isn't your idea of fun, and writing Ruby is, you can lean on Ruby harder (if slightly less efficiently):
doc.css('ID').select{ |el| el.text=="1234" }.map(&:parent).each(&:remove)
That says:
Find all the elements named <ID>
But pare that down do just the ones whose text is "1234"
Map this to be the <Description> nodes (the result of calling .parent on each)
And then call .remove on each of those.
If you know that there's only ever going to be one match, you can make it simpler with:
doc.css('ID').find{ |el| el.text=="1234" }.parent.remove
To find the ID do:
id = doc.xpath("//ID").text
where doc is the Nokogiri object created from loading the xml document
To check if the element id is what you want try:
if id == "1234"
From your xml file this should return true
Finally to remove the entire Description use:
doc.xpath("//Description").remove
What you're looking for is this:
doc = Nokogiri::XML(File.open("test.xml")) #create Nokogiri object from "test.xml"
id = doc.xpath("//ID").text #this will be a string with the id
doc.xpath("//Description").remove if id == "1234" #returns true with your xml document and remove the entire Description element."

Why does Nokogiri's to_xhtml create new `id` attributes from `name`?

Consider the following code:
require 'nokogiri' # v1.5.2
doc = Nokogiri.XML('<body><a name="foo">ick</a></body>')
puts doc.to_html
#=> <body><a name="foo">ick</a></body>
puts doc.to_xml
#=> <?xml version="1.0"?>
#=> <body>
#=> <a name="foo">ick</a>
#=> </body>
puts doc.to_xhtml
#=> <body>
#=> <a name="foo" id="foo">ick</a>
#=> </body>
Notice the new id attribute that has been created.
Who is responsible for this, Nokogiri or libxml2?
Why does this occur? (Is this enforcing a standard?)
The closest I can find is this spec describing how you may put both an id and name attribute with the same value.
Is there any way to avoid this, given the desire to use the to_xhtml method on input that may have <a name="foo">?
This problem arises because I have some input I am parsing with an id attribute on one element and a separate element with a name attribute that happens to conflict.
Apparently it's a feature of libxml2. In http://www.w3.org/TR/xhtml1/#h-4.10 we find:
In XML, fragment identifiers are of type ID, and there can only be a single attribute of type ID per element. Therefore, in XHTML 1.0 the id attribute is defined to be of type ID. In order to ensure that XHTML 1.0 documents are well-structured XML documents, XHTML 1.0 documents MUST use the id attribute when defining fragment identifiers on the elements listed above.
[...]
Note that in XHTML 1.0, the name attribute of these elements is formally deprecated, and will be removed in a subsequent version of XHTML.
The best 'workaround' I've come up with is:
# Destroy all <a name="..."> elements, replacing with children
# if another element with a conflicting id already exists in the document
doc.xpath('//a[#name][not(#id)][not(#href)]').each do |a|
a.replace(a.children) if doc.at_css("##{a['name']}")
end
Perhaps you could add some other id value to these elements to prevent libxml adding its own.
doc.xpath('//a[#name and not(#id)]').each do |n|
n['id'] = n['name'] + 'some_suffix'
end
(Obviously you’ll need to determine how to create a unique id value for your document).

Resources