I have a XML file:
<root>
<person name="brother">Abhijeet</person>
<person name="sister">pratiksha</person>
</root>
I want it to parse using Nokogiri. I tried by using CSS and XPath but it returns nil or the first element value. How do I retrieve other values?
I tried:
doc = Nokogiri::XML(xmlFile)
doc.elements.each do |f|
f.each do |y|
p y
end
end
and:
doc.xpath("//person/sister")
doc.at_xpath("//person/sister")
This is the basic way to search for a node with a given parameter and value using CSS:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<person name="brother">Abhijeet</person>
<person name="sister">pratiksha</person>
</root>
EOT
doc.at('person[name="sister"]').to_html # => "<person name=\"sister\">pratiksha</person>"
You need to research CSS and XPath and how their syntax work. In XPath //person/sister means search everywhere for <sister> nodes inside <person> nodes, matching something like:
<root>
<person>
<sister />
</person>
<person>
<sister />
</person>
</root>
Where it would find all the <sister /> nodes. It doesn't search for the parameter of a node.
Don't do:
doc.elements.each do |f|
f.each do |y|
p y
end
end
You're going to waste a lot of CPU walking through every element. Instead learn how selectors work, so you can take advantage of the power of libXML.
Related
I'm parsing some XML that I get from various feeds. Apparently some of the XML has an occasional tag that is all upper case. I'd like to normalize the XML to be all lower case tags to make searching, etc. easier.
What I want to do is something like:
parsed = Nokogiri::XML.parse(xml_content)
node = parsed.css("title") # => should return a Nokogiri node for the title tag
However, some of the XML documents have "TITLE" for that tag.
What are my options for getting that node whether it's tag is "title", "TITLE", or even "Title"?
Thanks!
If you want to transform your xml document by downcase'ing all tag names, here's one way to do it:
parsed = Nokogiri::XML.parse(xml_content)
parsed.traverse do |node|
node.name = node.name.downcase if node.kind_of?(Nokogiri::XML::Element)
end
As a general approach you could transform all element (tag) names to lower case (e.g. by using XSLT or another solution) and then do all of your XPath/CSS queries using lower case only.
This XSLT solution should work; however, my version of Ruby (2.0.0p481) and/or Nokogiri (1.5.6) complains mysteriously (perhaps about the use of the "lower-case(...)" function? Perhaps Nokogiri doesn't support XSLT v2?)
Here's a solution that seems to work:
require 'nokogiri'
xslt = Nokogiri::XSLT(File.read('lower.xslt'))
# <?xml version="1.0" encoding="UTF-8"?>
# <xsl:transform version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
# <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
# <xsl:template match="*">
# <xsl:element name="{translate(local-name(), $uppercase, $lowercase)}">
# <xsl:apply-templates />
# </xsl:element>
# </xsl:template>
# </xsl:transform>
doc = Nokogiri::XML(File.read('doc.xml'))
# <?xml version="1.0" encoding="UTF-8"?>
# <FOO>
# <BAR>Bar</BAR>
# <GAH>Gah</GAH>
# <ZIP><DOO><DAH/></DOO></ZIP>
# </FOO>
puts xslt.transform(doc)
# <?xml version="1.0"?>
# <foo>
# <bar>Bar</bar>
# <gah>Gah</gah>
# <zip><doo><dah/></doo></zip>
# </foo>
I am trying to parse some XML into an array. Here is a chunk of the XML I am parsing:
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<par_status>Y</par_status>
<Quality>
<GPRO_status>N</GPRO_status>
<ERX_status>N</ERX_status>
</Quality>
<Profile_Spec_list>
<Spec>08</Spec>
</Profile_Spec_list>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
...
</Group>
</Group_add>
I am currently able to get the attribute of "Group" and the text within "org_legal_name" and have them added to an array with the code below.
def parse(input_file, output_array)
puts "Parsing #{input_file} data. Please wait..."
doc = Nokogiri::XML(File.read(input_file))
doc.xpath("//Group").each do |group|
["org_legal_name"].each do |name|
output_array << [group["org_pac_id"], group.at(name).inner_html]
end
end
end
I would like to add the location "adrs_id" to the output_array as well, but can't seem to figure that part out.
Example output:
["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"]
["0000000002", "NAME OF GROUP 2", "OR974772594SP2280XRDXX301"]
Starting with:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
</xml>
EOT
Based on your XML I'd use:
array = []
array << doc.at('org_legal_name').text
array << doc.at('Location')['adrs_id']
array # => ["NAME OF GROUP", "OR974772594SP2280XRDXX300"]
If the XML is more complex, which I suspect it is, then we need an accurate, minimal, example of it.
Based on the updated XML, (which is still suspicious), here's what I'd use. Notice that I stripped out information that isn't germane to the question to reduce the XML to the minimal needed:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<xml>
<Group_add>
<Group org_pac_id="0000000001">
<org_legal_name>NAME OF GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX300">
<other_tags>xx</other_tags>
</Location>
</Group>
<Group org_pac_id="0000000002">
<org_legal_name>NAME OF ANOTHER GROUP</org_legal_name>
<Location adrs_id="OR974772594SP2280XRDXX301">
<other_tags>xx</other_tags>
</Location>
</Group>
</Group_add>
</xml>
EOT
data = doc.search('Group').map do |group|
[
group['org_pac_id'],
group.at('org_legal_name').text,
group.at('Location')['adrs_id']
]
end
Which results in:
data # => [["0000000001", "NAME OF GROUP", "OR974772594SP2280XRDXX300"], ["0000000002", "NAME OF ANOTHER GROUP", "OR974772594SP2280XRDXX301"]]
Think of the group variable being passed into the block as a placeholder. From that node it's easy to look downward into the DOM, and grab things that apply to only that particular node.
Note that I'm using CSS instead of XPath selectors. They're easier to read and usually work fine. Sometimes we need the added functionality of XPath, and sometimes Nokogiri's use of jQuery's CSS accessors give us things that are useful.
I am using Nokogiri to parse a XML document and want to output a list of locations where the product name matches a string.
I'm able to output a list of all product names or a list of all locations but I'm not able to compare the two. Removing the if portion of the statement correctly outputs all the locations. What am I doing wrong with my regex?
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
doc.xpath("//product").each do |x|
puts x.xpath("location") if x.xpath("name") =~ /cool_fish/
end
A few things going on here:
As others have pointed out, you should be parsing as XML not HTML, although that wouldn’t actually make much difference to the results you get.
You are parsing as a DocumentFragment, you should parse as a complete document. There are some issues involved querying document fragments, in particular queries starting with // don’t work right.
The location element is actually at the position product_details/location relative to the product node in your XML, so you need to update your query to take that into account.
You are trying to use the =~ operator on the result of the xpath method which is a Nokogiri::XML::NodeSet. NodeSet doesn’t define a =~ method, so it uses the default one on Object that just returns nil, so it will never match. You should use at_xpath to only get the first result, and then call text on it to get the string that you can match using =~.
(Also you use #doc and doc, but I’m assuming that’s just a typo.)
So combining those four points your code will look like:
#parse using XML, and not a fragment
doc = Nokogiri::XML <<-EOXML
# ... XML elided for space
EOXML
doc.xpath("//product").each do |x|
# correct query, use at_xpath and call text method
puts x.at_xpath("product_details/location") if x.at_xpath("name").text =~ /cool_fish/
end
However in this case you could do it all in a single XPath query, using the contains function:
# parse doc as XML document as above
puts doc.xpath("//product[contains(name, 'cool_fish')]/product_details/location")
This works because you have a fairly simple regex that only checks against a literal string. XPath 1.0 doesn’t have support for regex, so if your real use case involves a more complex one you may need to do it the “hard way”. (You could write a custom XPath function in that case, but that’s another story.)
Write your code as below :
require 'nokogiri'
#doc = Nokogiri::XML <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
#doc.xpath("//product").each do |x|
puts x.at_xpath(".//location").text if x.at_xpath(".//name").text =~ /cool_fish/
end
# >> ocean
You are parsing an xml, you should use Nokogiri::XML. Your xpath expression was also incorrect. You wrote #xpath method, but you were using expression, which you should use with methods like css or search. I used at_xpath method, as you were interested with the single node match inside the #each block.
But you can use at in place of #at_xpath and search in place of xpath.
Remember search and at both understand CSS, as well as xpath expressions. search or xpath or css all methods will give you NodeSet, where as at, at_css or at_xpath would give you a Node. Once a Nokogiri node will be in your hand, use text method to get the content of that node.
I would suggest using Nokogiri::XML instead
#doc = Nokogiri::XML::Document.parse <<-EOXML
<?xml version="1.0"?>
<root>
<product>
<name>cool_fish</name>
<product_details>
<location>ocean</location>
<costs>
<msrp>9.99</msrp>
<margin>5.00</margin>
</costs>
</product_details>
</product>
<product>
<name>veggies</name>
<product_details>
<location>field</location>
<costs>
<msrp>2.99</msrp>
<margin>1.00</margin>
</costs>
</product_details>
</product>
</root>
EOXML
and then the Nokogiri::Node#search and Nokogiri::Node#at methods
#doc.search("product").each do |x|
puts x.at("location").content if x.at("name").content =~ /cool_fish/
end
I am using Nokogiri (1.5.9 - java) in JRuby ( 1.6.7.2 ) to copy an XML template and edit it. I'm having problems finding elements in the cloned document.
lblock = doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first
lblock.children = new_children # kind of NodeSet or Node
copy_doc = doc.dup( 1 ) # or dup(0)
lblock = copy_doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first # nil
When print to_s or to_xml, so lblock there is with new_children.
Where is my mistake?
I can't duplicate the problem:
require 'nokogiri'
new_children = Nokogiri::XML::DocumentFragment.parse('<foo>bar</foo>')
doc = Nokogiri::XML(<<EOF)
<xml>
<lblock blockName="WINDOW_LIST" />
</xml>
EOF
lblock = doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first
lblock.children = new_children # kind of NodeSet or Node
copy_doc = doc.dup(1) # or dup(0)
lblock = copy_doc.xpath(".//lblock[#blockName='WINDOW_LIST']").first # nil
puts lblock.to_xml
puts
puts doc.to_xml
Running that outputs:
<lblock blockName="WINDOW_LIST">
<foo>bar</foo>
</lblock>
<?xml version="1.0"?>
<xml>
<lblock blockName="WINDOW_LIST"><foo>bar</foo></lblock>
</xml>
That said, here's code that is cleaned up to show you some simpler ways to write it:
require 'nokogiri'
new_children = '<foo>bar</foo>'
doc = Nokogiri::XML(<<EOF)
<xml>
<lblock blockName="WINDOW_LIST" />
</xml>
EOF
lblock = doc.at_xpath('//lblock')
lblock.children = new_children
copy_doc = doc.dup(1)
lblock = copy_doc.at_css('lblock')
puts lblock.to_xml
puts
puts doc.to_xml
Which outputs this too after running:
<lblock blockName="WINDOW_LIST">
<foo>bar</foo>
</lblock>
<?xml version="1.0"?>
<xml>
<lblock blockName="WINDOW_LIST"><foo>bar</foo></lblock>
</xml>
Dissecting the code:
lblock = doc.at_xpath('//lblock')
lblock = copy_doc.at_css('lblock')
These use two different ways of finding the same thing. In this case, because the sample XML was simple, I used at, which returns the first matching node. at_xpath and at_css work with XPaths and CSS respectively. at would try to figure out whether the string is CSS or XPath, and normally gets it right, though I have seen it fooled.
lblock.children = new_children
In this case, new_children is a String. Nokogiri is smart enough to know it should convert the string into an XML fragment before using it. This makes it very easy to modify XML or HTML documents with strings, instead of having to create DocumentFragments.
Given the following xml which has been parsed into #response using Nokogiri
<?xml version="1.0" encoding="UTF-8"?>
<foos type="array">
<foo>
<id type="integer">1</id>
<name>bar</name>
</foo>
</foos>
Does an xpath exist such that #response.xpath(xpath) returns array?
Assume that this xpath must be reused across multiple documents where the naming of foo is inconsistent.
If an xpath is not the correct tool to solve this problem, does Nokogiri provide a method that is?
This xml is automatically generated by the rails framework, and the answer to this question is intended to be used to create an XML equivalent to this Cucumber feature for JSON responses.
If you want to select the root node when its type attribute is array (regardless of the root element's name), then use this:
/*[#type='array']
For its children, use:
/*[#type='array']/*
Simply:
if doc.root['type']=='array'
Here's a test case:
#response = <<ENDXML
<?xml version="1.0" encoding="UTF-8"?>
<foos type="array">
<foo>
<id type="integer">1</id>
<name>bar</name>
</foo>
</foos>
ENDXML
require 'nokogiri'
doc = Nokogiri.XML(#response)
if doc.root['type']=='array'
puts "It is!"
else
puts "Nope"
end
Depending on your needs, you might want to:
case doc.root['type']
when 'array'
#...
when 'string'
#...
else
#...
end