XML parsing elements and element attributes into array - ruby

I am trying to parse some XML into an array. Here is a chunk of the XML I am parsing:
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<par_status>Y</par_status>
<Quality>
<GPRO_status>N</GPRO_status>
<ERX_status>N</ERX_status>
</Quality>
<Profile_Spec_list>
<Spec>08</Spec>
</Profile_Spec_list>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
...
</Group>
</Group_add>
I am currently able to get the attribute of "Group" and the text within "org_legal_name" and have them added to an array with the code below.
def parse(input_file, output_array)
puts "Parsing #{input_file} data. Please wait..."
doc = Nokogiri::XML(File.read(input_file))
doc.xpath("//Group").each do |group|
["org_legal_name"].each do |name|
output_array << [group["org_pac_id"], group.at(name).inner_html]
end
end
end
I would like to add the location "adrs_id" to the output_array as well, but can't seem to figure that part out.
Example output:
["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"]
["0000000002", "NAME OF GROUP 2", "OR974772594SP2280XRDXX301"]

Starting with:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
</xml>
EOT
Based on your XML I'd use:
array = []
array << doc.at('org_legal_name').text
array << doc.at('Location')['adrs_id']
array # => ["NAME OF GROUP", "OR974772594SP2280XRDXX300"]
If the XML is more complex, which I suspect it is, then we need an accurate, minimal, example of it.
Based on the updated XML, (which is still suspicious), here's what I'd use. Notice that I stripped out information that isn't germane to the question to reduce the XML to the minimal needed:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
<org_legal_name>NAME OF ANOTHER GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX301">
<other_tags>xx</other_tags>
</Location>
</Group>
</Group_add>
</xml>
EOT
data = doc.search('Group').map do |group|
[
group['org_pac_id'],
group.at('org_legal_name').text,
group.at('Location')['adrs_id']
]
end
Which results in:
data # => [["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"], ["0000000002", "NAME OF ANOTHER GROUP", "OR974772594SP2280XRDXX301"]]
Think of the group variable being passed into the block as a placeholder. From that node it's easy to look downward into the DOM, and grab things that apply to only that particular node.
Note that I'm using CSS instead of XPath selectors. They're easier to read and usually work fine. Sometimes we need the added functionality of XPath, and sometimes Nokogiri's use of jQuery's CSS accessors give us things that are useful.

Related

How to parse an XML file using Nokogiri and Ruby

I have a XML file:
<root>
<person name="brother">Abhijeet</person>
<person name="sister">pratiksha</person>
</root>
I want it to parse using Nokogiri. I tried by using CSS and XPath but it returns nil or the first element value. How do I retrieve other values?
I tried:
doc = Nokogiri::XML(xmlFile)
doc.elements.each do |f|
f.each do |y|
p y
end
end
and:
doc.xpath("//person/sister")
doc.at_xpath("//person/sister")
This is the basic way to search for a node with a given parameter and value using CSS:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<person name="brother">Abhijeet</person>
<person name="sister">pratiksha</person>
</root>
EOT
doc.at('person[name="sister"]').to_html # => "<person name=\"sister\">pratiksha</person>"
You need to research CSS and XPath and how their syntax work. In XPath //person/sister means search everywhere for <sister> nodes inside <person> nodes, matching something like:
<root>
<person>
<sister />
</person>
<person>
<sister />
</person>
</root>
Where it would find all the <sister /> nodes. It doesn't search for the parameter of a node.
Don't do:
doc.elements.each do |f|
f.each do |y|
p y
end
end
You're going to waste a lot of CPU walking through every element. Instead learn how selectors work, so you can take advantage of the power of libXML.

Nokogiri compare field and puts

I am using Nokogiri to parse a XML document and want to output a list of locations where the product name matches a string.
I'm able to output a list of all product names or a list of all locations but I'm not able to compare the two. Removing the if portion of the statement correctly outputs all the locations. What am I doing wrong with my regex?
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
doc.xpath("//product").each do |x|
puts x.xpath("location") if x.xpath("name") =~ /cool_fish/
end
A few things going on here:
As others have pointed out, you should be parsing as XML not HTML, although that wouldn’t actually make much difference to the results you get.
You are parsing as a DocumentFragment, you should parse as a complete document. There are some issues involved querying document fragments, in particular queries starting with // don’t work right.
The location element is actually at the position product_details/location relative to the product node in your XML, so you need to update your query to take that into account.
You are trying to use the =~ operator on the result of the xpath method which is a Nokogiri::XML::NodeSet. NodeSet doesn’t define a =~ method, so it uses the default one on Object that just returns nil, so it will never match. You should use at_xpath to only get the first result, and then call text on it to get the string that you can match using =~.
(Also you use #doc and doc, but I’m assuming that’s just a typo.)
So combining those four points your code will look like:
#parse using XML, and not a fragment
doc = Nokogiri::XML <<-EOXML
# ... XML elided for space
EOXML
doc.xpath("//product").each do |x|
# correct query, use at_xpath and call text method
puts x.at_xpath("product_details/location") if x.at_xpath("name").text =~ /cool_fish/
end
However in this case you could do it all in a single XPath query, using the contains function:
# parse doc as XML document as above
puts doc.xpath("//product[contains(name, 'cool_fish')]/product_details/location")
This works because you have a fairly simple regex that only checks against a literal string. XPath 1.0 doesn’t have support for regex, so if your real use case involves a more complex one you may need to do it the “hard way”. (You could write a custom XPath function in that case, but that’s another story.)
Write your code as below :
require 'nokogiri'
#doc = Nokogiri::XML <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
#doc.xpath("//product").each do |x|
puts x.at_xpath(".//location").text if x.at_xpath(".//name").text =~ /cool_fish/
end
# >> ocean
You are parsing an xml, you should use Nokogiri::XML. Your xpath expression was also incorrect. You wrote #xpath method, but you were using expression, which you should use with methods like css or search. I used at_xpath method, as you were interested with the single node match inside the #each block.
But you can use at in place of #at_xpath and search in place of xpath.
Remember search and at both understand CSS, as well as xpath expressions. search or xpath or css all methods will give you NodeSet, where as at, at_css or at_xpath would give you a Node. Once a Nokogiri node will be in your hand, use text method to get the content of that node.
I would suggest using Nokogiri::XML instead
#doc = Nokogiri::XML::Document.parse <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
and then the Nokogiri::Node#search and Nokogiri::Node#at methods
#doc.search("product").each do |x|
puts x.at("location").content if x.at("name").content =~ /cool_fish/
end

Get all attributes for elements in XML file

I'm trying to parse a file and get all of the attributes for each <row> tag in the file. The file looks generally like this:
<?xml version="1.0" standalone="yes"?>
<report>
<table>
<columns>
<column name="month"/>
<column name="campaign"/>
<!-- many columns -->
</columns>
<rows>
<row month="December 2009" campaign="Campaign #1"
adgroup="Python" preview="Not available"
headline="We Write Apps in Python"
and="many more attributes here" />
<row month="December 2009" campaign="Campaign #1"
adgroup="Ruby" preview="Not available"
headline="We Write Apps in Ruby"
and="many more attributes here" />
<!-- many such rows -->
</rows></table></report>
Here is the full file: http://pastie.org/7268456#2.
I've looked at every tutorial and answer I can find on various help boards but they all assume the same thing- I'm searching for one or two specific tags and just need one or two values for those tags. I actually have 18 attributes for each <row> tag and I have a mysql table with a column for each of the 18 attributes. I need to put the information into an object/hash/array that I can use to insert into the table with ActiveRecord/Ruby.
I started out using Hpricot; you can see the code (which is not relevant) in the edit history of this question.
require 'nokogiri'
doc = Nokogiri.XML(my_xml_string)
doc.css('row').each do |row|
# row is a Nokogiri::XML::Element
row.attributes.each do |name,attr|
# name is a string
# attr is a Nokogiri::XML::Attr
p name => attr.value
end
end
#=> {"month"=>"December 2009"}
#=> {"campaign"=>"Campaign #1"}
#=> {"adgroup"=>"Python"}
#=> {"preview"=>"Not available"}
#=> {"headline"=>"We Write Apps in Python"}
#=> etc.
Alternatively, if you just want an array of hashes mapping attribute names to string values:
rows = doc.css('row').map{ |row| Hash[ row.attributes.map{|n,a| [n,a.value]} ] }
#=> [
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Python", … },
#=> {"month"=>"December 2009", "campaign"=>"Campaign #1", adgroup="Ruby", … },
#=> …
#=> ]
The Nokogiri.XML method is the simplest way to parse an XML string and get a Nokogiri::Document back.
The css method is the simplest way to find all the elements with a given name (ignoring their containment hierarchy and any XML namespaces). It returns a Nokogiri::XML::NodeSet, which is very similar to an array.
Each Nokogiri::XML::Element has an attributes method that returns a Hash mapping the name of the attribute to a Nokogiri::XML::Attr object containing all the information about the attribute (name, value, namespace, parent element, etc.)

How do I validate specific attributes in XML using Ruby's REXML?

I'm trying to read some XML I've retrieved from a web service, and validate a specific attribute within the XML.
This is the XML up to the tag that I need to validate:
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<QueryResponse xmlns="http://tempuri.org/">
<QueryResult xmlns:a="http://schemas.datacontract.org/2004/07/Entity"
xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:Navigation i:nil="true" />
<a:SearchResult>
<a:EntityList>
<a:BaseEntity i:type="a:Product">
<a:ExtractDateTime>1290398428</a:ExtractDateTime>
<a:ExtractDateTimeFormatted>11/22/2010
04:00:28</a:ExtractDateTimeFormatted>
Here's the code I have thus far using REXML in Ruby:
require 'xmlsimple'
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
class Listener
include StreamListener
xmlfile = File.new("rbxml_CS_Query.xml")
xmldoc = Document.new(xmlfile)
# Now get the root element
root = xmldoc.root
puts root.attributes["a:EntityList"]
# This will output the date/time of the query response
xmldoc.elements.each("a:BaseEntity"){
|e| puts e.attributes["a:ExtractDateTimeFormatted"]
}
end
I need to validate that ExtractDateTimeFormatted is there and has a valid value for that attribute. Any help is greatly appreciated. :)
Reading from local xml file.
File.open('temp.xml', 'w') { |f|
f.puts request
f.close
}
xml = File.read('temp.xml')
doc = Nokogiri::XML::Reader(xml)
extract_date_time_formatted = doc.at(
'//a:ExtractDateTimeFormatted',
'a' => 'http://schemas.datacontract.org/2004/07/Entity'
).inner_text
show = DateTime.strptime(extract_date_time_formatted, '%m/%d/%Y')
puts show
When I run this code I get an error: "undefined method 'at' for # on line 21
Are you tied to REXML or can you switch to Nokogiri? I highly recommend Nokogiri over the other Ruby XML parsers.
I had to add enough XML tags to make the sample validate.
require 'date'
require 'nokogiri'
xml = %q{<?xml version="1.0"?>
<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
<s:Body>
<QueryResponse xmlns="http://tempuri.org/">
<QueryResult xmlns:a="http://schemas.datacontract.org/2004/07/Entity" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:Navigation i:nil="true"/>
<a:SearchResult>
<a:EntityList>
<a:BaseEntity i:type="a:Product">
<a:ExtractDateTime>1290398428</a:ExtractDateTime>
<a:ExtractDateTimeFormatted>11/22/2010</a:ExtractDateTimeFormatted>
</a:BaseEntity>
</a:EntityList>
</a:SearchResult>
</QueryResult>
</QueryResponse>
</s:Body>
</s:Envelope>
}
doc = Nokogiri::XML(xml)
extract_date_time_formatted = doc.at(
'//a:ExtractDateTimeFormatted',
'a' => 'http://schemas.datacontract.org/2004/07/Entity'
).inner_text
puts DateTime.strptime(extract_date_time_formatted, '%m/%d/%Y')
# >> 2010-11-22T00:00:00+00:00
There's a couple things going on that could make this harder to handle than a simple XML file.
The XML is using namespaces. They are useful, but you have to tell the parser how to handle them. That is why I had to add the second parameter to the at() accessor.
The date value is in a format that is often ambiguous. For many days of the year it is hard to tell whether it is mm/dd/yyyy or dd/mm/yyyy. Here in the U.S. we assume it's the first, but Europe is the second. The DateTime parser tries to figure it out but often gets it wrong, especially when it thinks it's supposed to be dealing with a month 22. So, rather than let it guess, I told it to use mm/dd/yyyy format. If a date doesn't match that format, or the date's values are out of range Ruby will raise an exception, so you'll need to code for that.
This is an example of how to retrieve and parse XML on the fly:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://java.sun.com/developer/earlyAccess/xml/examples/samples/book-order.xml'))
puts doc.class
puts doc.to_xml
And an example of how to read a local XML file and parse it:
require 'nokogiri'
doc = Nokogiri::XML(File.read('test.xml'))
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <root xmlns:foo="bar">
# >> <bar xmlns:hello="world"/>
# >> </root>

Nokogiri and XML Formatting When Inserting Tags

I'd like to use Nokogiri to insert nodes into an XML document. Nokogiri uses the Nokogiri::XML::Builder class to insert or create new XML.
If I create XML using the new method, I'm able to create nice, formatted XML:
builder = Nokogiri::XML::Builder.new do |xml|
xml.product {
xml.test "hi"
}
end
puts builder
outputs the following:
<?xml version="1.0"?>
<product>
<test>hi</test>
</product>
That's great, but what I want to do is add the above XML to an existing document, not create a new document. According to the Nokogiri documentation, this can be done by using the Builder's with method, like so:
builder = Nokogiri::XML::Builder.with(document.at('products')) do |xml|
xml.product {
xml.test "hi"
}
end
puts builder
When I do this, however, the XML all gets put into a single line with no indentation. It looks like this:
<products><product><test>hi</test></product></products>
Am I missing something to get it to format correctly?
Found the answer in the Nokogiri mailing list:
In XML, whitespace can be considered
meaningful. If you parse a document
that contains whitespace nodes,
libxml2 will assume that whitespace
nodes are meaningful and will not
insert them for you.
You can tell libxml2 that whitespace
is not meaningful by passing the
"noblanks" flag to the parser. To
demonstrate, here is an example that
reproduces your error, then does what
you want:
require 'nokogiri'
def build_from node
builder = Nokogiri::XML::Builder.with(node) do|xml|
xml.hello do
xml.world
end
end
end
xml = DATA.read
doc = Nokogiri::XML(xml)
puts build_from(doc.at('bar')).to_xml
doc = Nokogiri::XML(xml) { |x| x.noblanks }
puts build_from(doc.at('bar')).to_xml
Output:
<root>
<foo>
<bar>
<baz />
</bar>
</foo>
</root>

Resources