I have a string:
<products type="array">
<product><brand>Rho2</brand>
<created-at type="datetime">2011-11-03T21:29:46Z</created-at><id type="integer">78013</id><name>Test2</name>
<price nil="true"/>
<quantity nil="true"/>
<sku nil="true"/>
<updated-at type="datetime">2011-11-03T21:29:46Z</updated-at>
</product>
<product>
<brand>Apple</brand>
<created-at type="datetime">2011-10-26T21:26:59Z</created-at>
<id type="integer">77678</id>
<name>iPhone</name>
<price>$199.99</price>
<quantity>5</quantity>
<sku>1234</sku>
<updated-at type="datetime">2011-10-26T21:27:00Z</updated-at>
</product>
I want to get the text between <brand> and </brand>.
I am trying to parse this XML, collecting data between tags.
XmlSimple should be easy.
require 'xmlsimple'
products = XmlSimple.xml_in('<YOUR WHOLE XML>', { 'KeyAttr' => 'product' })
You should use any XML parser available in you platform. Then you can use simple XPath expression:
//brand
It selects all brand elements in document.
The defacto standard for parsing XML and HTML in Ruby is Nokogiri these days:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<products type="array">
<product>
<brand>Rho2</brand>
<created-at type="datetime">2011-11-03T21:29:46Z</created-at>
</product>
<product>
<brand>Apple</brand>
<created-at type="datetime">2011-10-26T21:26:59Z</created-at>
</product>
</products>
EOT
puts doc.search('brand').map(&:text)
Which outputs:
Rho2
Apple
function getStringBetween(str , fromStr , toStr){
var fromStrIndex = str.indexOf(fromStr) == -1 ? 0 : str.indexOf(fromStr) + fromStr.length;
var toStrIndex = str.slice(fromStrIndex).indexOf(toStr) == -1 ? str.length-1 : str.slice(fromStrIndex).indexOf(toStr) + fromStrIndex;
var strBtween = str.substring(fromStrIndex,toStrIndex);
return strBtween;
}
Related
Base on below XML exemple file employees.xml and using Ruby Nokogiri gem I wan to open this file, change the building number to 320 and the room number to 99 for Sandra Defoe and save the changes. What is the recommended way to do it.
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>
I'd use this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
</employees>
EOT
first_name = 'Sandra'
last_name = 'Defoe'
node = doc.at("//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name])
node.at('building').content = '320'
node.at('room').content = '99'
Which results in:
doc.to_xml
# => "\uFEFF<?xml version=\"1.0\" encoding=\"utf-16\"?>\n" +
# "<employees>\n" +
# " <employee id=\"be130\">\n" +
# " <firstname>William</firstname>\n" +
# " <lastname>Defoe</lastname>\n" +
# " <building>326</building>\n" +
# " <room>14a</room>\n" +
# " </employee>\n" +
# " <employee id=\"be132\">\n" +
# " <firstname>Sandra</firstname>\n" +
# " <lastname>Defoe</lastname>\n" +
# " <building>320</building>\n" +
# " <room>99</room>\n" +
# " </employee>\n" +
# "</employees>\n"
Normally I recommend using CSS selectors because they tend to result in less visual noise, however CSS doesn't let us peek into the text of nodes, and working around that, while possible, results in even more noise. XPath, on the other hand, can be very noisy, but for this sort of task, it's more usable.
XPath is very well documented and figuring out what this is doing should be pretty easy.
The Ruby side of it is using a "format string":
"//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name])
similar to
"%s %s" % [first_name, last_name] # => "Sandra Defoe"
"//employee[firstname/text()='%s' and lastname/text()='%s']" % [first_name, last_name]
# => "//employee[firstname/text()='Sandra' and lastname/text()='Defoe']"
Just for thoroughness, here's what I'd do if I wanted to use CSS exclusively:
node = doc.search('employee').find { |node|
node.at('firstname').text == first_name && node.at('lastname').text == last_name
}
This gets ugly though, because search tells Nokogiri to retrieve all employee nodes from libXML, then Ruby has to walk through them all telling Nokogiri to tell libXML to look in the child firstname and lastname nodes and return their text. That's slow, especially if there are many employee nodes and the one you want is at the bottom of the file.
The XPath selector tells Nokogiri to pass the search to libXML which parses it, finds the employee node with the child nodes containing the first and last names and returns only that node. It's much faster.
Note that at('employee') is equivalent to search('employee').first.
# File 'lib/nokogiri/xml/searchable.rb', line 70
def at(*args)
search(*args).first
end
Finally, mediate on the difference between a NodeSet#text and Node#text as the first will lead to insanity.
Assume your content is a string:
xml=%q(
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>327</building>
<room>22</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>)
doc = Nokogiri.parse(xml)
This will work but assumes the first and last names are unique, otherwise it will modify the first match of first and last name.
target = doc.css('employee').find do |node|
node.search('firstname').text == 'Sandra' &&
node.search('lastname').text == 'Defoe'
end
target.at_css('building').content = '320'
target.at_css('room').content = '99'
doc # outputs the updated xml
=> <?xml version="1.0"?>
<?xml version="1.0" encoding="utf-16"?>
<employees>
<employee id="be129">
<firstname>Jane</firstname>
<lastname>Doe</lastname>
<building>327</building>
<room>19</room>
</employee>
<employee id="be130">
<firstname>William</firstname>
<lastname>Defoe</lastname>
<building>326</building>
<room>14a</room>
</employee>
<employee id="be132">
<firstname>Sandra</firstname>
<lastname>Defoe</lastname>
<building>320</building>
<room>99</room>
</employee>
<employee id="be133">
<firstname>Steve</firstname>
<lastname>Casey</lastname>
<building>327</building>
<room>24</room>
</employee>
</employees>
I have the following XML with two data in the same line the ID and the product description ,such as, ID=18863 for paper A4, ID=18858 for TV...)
<products>
<product id="18863">paper A4 </product>
<product id="18858">TV Smart 12 </product>
<product id="18857">KitKat </product>
<product id="8816">Pen </product>
</products>
How do I take the ID and the description (paper A4, TV Smart 12...)?
#doc = Nokogiri::XML(open("http://url/file.xml"))
#doc = #doc.xpath(".//products/product")
Thank you
Return a Hash of Content and Attributes
There's more than one way to do this, but the method I find most intuitive is to return a hash of each node's ID and contents. For example:
require 'nokogiri'
#doc = Nokogiri::XML <<'EOF'
<products>
<product id="18863">paper A4 </product>
<product id="18858">TV Smart 12 </product>
<product id="18857">KitKat </product>
<product id="8816">Pen </product>
</products>
EOF
#doc.xpath('//products/product').
map { |p| [p.attribute('id').value, p.content] }.to_h
This will return a hash, where each ID is the key and the product name is the value. For example, the code above returns:
{"18863"=>"paper A4 ",
"18858"=>"TV Smart 12 ",
"18857"=>"KitKat ",
"8816"=>"Pen "}
You may want to use p.content.strip to remove the trailing whitespace from each product, too, but that's outside the scope of your original question.
Note: The above works fine with Ruby 2.1.0 and the IRB console. Your mileage may vary with other Ruby versions, or with Pry.
#doc = Nokogiri::HTML(open('/test.html'))
#doc.xpath('//products/product').each do |p|
puts "#{p['id']} #{p.content}"
end
Result:
18863 paper A4
18858 TV Smart 12
18857 KitKat
8816 Pen
More examples here - http://nokogiri.org/Nokogiri/XML/Node.html
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
I want to remove the "pencil" node and print the remaining xml using REXML (Ruby). Can anybody tell me how to do that ?
By using one of the delete methods http://rubydoc.info/stdlib/rexml/
require "rexml/document"
string = <<EOF
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
</product>
EOF
doc = REXML::Document.new(string)
doc.delete_element('//pencil')
puts doc
There is also nice tutorial to get you started: http://www.germane-software.com/software/rexml/docs/tutorial.html
I have the following XML, I am trying to get the unique nodes based on the name child node.
Original XML:
<products>
<product>
<name>White Socks</name>
<price>2.00</price>
</product>
<product>
<name>White Socks/name>
<price>2.00</price>
</product>
<product>
<name>Blue Socks</name>
<price>3.00</price>
</product>
</products>
What I'm trying to get:
<products>
<product>
<name>White Socks</name>
<price>2.00</price>
</product>
<product>
<name>Blue Socks</name>
<price>3.00</price>
</product>
</products>
I've tried various things but not worth listing here, the closest I got was using XPath but that just returned the names like below. However, this is wrong as I want the full XML as above, not just the node values.
White Socks
Blue Socks
I'm using Ruby and trying to iterate over the nodes like so:
#doc.xpath("//product").each do |node|
Obviously the above currently gets ALL product nodes, whereas I want all unique product nodes (using the child node "name" as the unique identifier)
This transformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="kProdByName" match="product"
use="name"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*"/>
</xsl:copy>
</xsl:template>
<xsl:template match=
"product
[not(generate-id()
=
generate-id(key('kProdByName',name)[1])
)
]"/>
</xsl:stylesheet>
when applied on the provided XML document (corrected to be well-formed):
<products>
<product>
<name>White Socks</name>
<price>2.00</price>
</product>
<product>
<name>White Socks</name>
<price>2.00</price>
</product>
<product>
<name>Blue Socks</name>
<price>3.00</price>
</product>
</products>
produces the wanted, correct result:
<products>
<product>
<name>White Socks</name>
<price>2.00</price>
</product>
<product>
<name>Blue Socks</name>
<price>3.00</price>
</product>
</products>
Do note:
The identity rule copies every node "as-is".
The Muenchian method for grouping is used.
There is a single overriding template that excludes any product element that is not the first in its group.
XPath-one-liner (Note this is O(N^2) -- will be very slow on many product elements):
/*/product[not(name = following-sibling::product/name)]
With XSLT you can use Muenchian grouping to eliminate duplicates as follows:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:key name="prod-by-name" match="product" use="name"/>
<xsl:template match="#* | node()">
<xsl:copy>
<xsl:apply-templates select="#* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="product[not(generate-id() = generate-id(key('prod-by-name', name)[1]))]"/>
</xsl:stylesheet>
My XML:
<root>
<cars>
<makes>
<honda year="1995">
<model />
<!-- ... -->
</honda>
<honda year="2000">
<!-- ... -->
</honda>
</makes>
</cars>
</root>
I need a XPath that will get me all models for <honda> with year 1995.
so:
/root/cars/makes/honda
But how to reference an attribute?
"I need a XPath that will get me all models for <honda> with year 1995."
That would be:
/root/cars/makes/honda[#year = '1995']/model
Try /root/cars/makes/honda/#year
UPDATE: reading your question again:
/root/cars/makes/honda[#year = '1995']
Bottom line is: use # character to reference xml attributes.