Using Nokogiri with XML files in Ruby - ruby

I have this XML:
<Experiment>
<mzData version="1.05" accessionNumber="1635">
<description>
<admin>
<sampleName>Fas-induced and control Jurkat T-lymphocytes</sampleName>
<sampleDescription>
<cvParam cvLabel="MeSH" accession="D017209" name="apoptosis" />
<cvParam cvLabel="UNITY" accession="D2135" name="Jurkat cells" />
<cvParam cvLabel="MeSH" accession="D019014" name="Antigens, CD95" />
</sampleDescription>
</admin>
</description>
</mzData>
</Experiment>
</ExperimentCollection>
I also have the following code:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("my.xml"))
sampleName = doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text
sampleDescription = doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleDescription/MeSH/#accession" ).text
puts sampleName + " " + sampleDescription
foo = sampleName + " " + sampleDescription
f = File.new("my.txt","w")
f.write(foo)
f.close()
The code grabs the sampleName just fine, but not the accession letters/numbers. I only want to grab all the letters/numbers after MeSH -> accession (D017209 and D019014). What do I have to change in the doc.xpath command to make this work?

doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleDescription/MeSH/#accession" )
Returns nothing because there is no tag MeSH. You need to replace MeSH with cvParam[#cvLabel=\"MeSH\"] (read: a cvParam tag which has an attribute cvLabel with the value MeSH).
Once you fixed that xpath will return a collection of Nokogiri::XML::Attr objects. By calling text on that collection you will get back the string value of the first element. Since you want all of the elements you should instead use map(&:text) (or map {|n| n.text} in ruby 1.8.6) which will return an array containing the string value of each accession attribute (i.e. ["D017209", "D019014"] for the example XML-file).
Since you seem to be confused, here's a clarification:
#Bobby: When I said "xpath will return a collection of Nokogiri::XML::Attr objects", I meant just that. You call xpath and then xpath creates and returns a collection of Attr objects. In no way did I mean that you should manually create any Attr objects yourself.
And when I said you should use map, I just meant you should call map on the collection returned by xpath (though instead of using map you can just call puts with the collection as an argument).
So what you need to do is 1. fix your xpath like I described.
use xpath with the fixed xpath to get a collection
use puts to print it
In other words:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("my.xml"))
common_prefix = "/ExperimentCollection/Experiment/mzData/description/admin"
sample_name = doc.xpath( common_prefix+"/sampleName" ).text
accessions = doc.xpath( common_prefix+
"/sampleDescription/cvParam[#cvLabel=\"MeSH\"]/#accession" )
puts sample_name
puts accessions

Here is a simple way to do it, although this is probably too clever, because you'll probably want to do other things as well:
File.open("my.txt","w") do |f|
doc.xpath('//cvParam[#cvLabel="MeSH"]').each {|n| f << "#{n['name']} #{n['accession']}\n"}
end
You may need a more selective xpath statement.

Related

How to search for prop elements in TMX with Nokogiri

I have a TMX translation memory file that I need to parse to be able to import it into a new DB. I'm using Ruby + Nokogiri. This is the TMX (xml) structure:
<body>
<tu creationdate="20181001T113609Z" creationid="some_user">
<prop type="Att::Attribute1">Value1</prop>
<prop type="Txt::Attribute2">Value2</prop>
<prop type="Txt::Attribute3">Value3</prop>
<prop type="Txt::Attribute4">Value4</prop>
<tuv xml:lang="EN-US">
<seg>Testing</seg>
</tuv>
<tuv xml:lang="SL">
<seg>Testiranje</seg>
</tuv>
</tu>
</body>
I've only included 1 TU node here for simplicity.
This is my current script:
require 'nokogiri'
doc = File.open("test_for_import.xml") { |f| Nokogiri::XML(f) }
doc.xpath('//tu').each do |x|
puts "Creation date: " + x.attributes["creationdate"]
puts "User: " + x.attributes["creationid"]
x.children.each do |y|
puts y.children
end
end
This yields the following:
Creation date: 20181001T113609Z
User: some_user
Value1
Value2
Value3
Value4
<seg>Testing</seg>
<seg>Testiranje</seg>
What I need to do get is to search for Attribute1 and it's corresponding value and assign to a variable. These will then be used as attributes when creating translation records in the new DB. I need the same for seg to get the source and the translation. I don't want to rely on the sequence, even though it should/is always the same.
What is the best way to continue? All the elements are of class Nokogiri::XML::NodeSet . Even after looking at the docs for this I'm still stuck.
Can someone help?
Best, Sebastjan
The easiest way to traverse a node tree like this is using XPath. You've already used XPath for getting your top-level tu element, but you can extend XPath queries much further to get specific elements like you're looking for.
Here on DevHints is a handy cheat-sheet for what you can do with XPath.
Relative to your x variable which points to the tu element, here are the XPaths you'll want to use:
prop[#type="Att::Attribute1"] for finding your prop for Attribute 1
//seg or tuv/seg for finding the seg elements
Here's a complete code example using those XPaths. The at_xpath method returns one result, whereas the xpath method returns all results.
require 'nokogiri'
doc = File.open("test_for_import.xml") { |f| Nokogiri::XML(f) }
doc.xpath('//tu').each do |x|
puts "Creation date: " + x.attributes["creationdate"]
puts "User: " + x.attributes["creationid"]
# Get Attribute 1
# There should only be one result for this, so using `at_xpath`
attr1 = x.at_xpath('prop[#type="Att::Attribute1"]')
puts "Attribute 1: " + attr1.text
# Get each seg
# There will be many results, so using `xpath`
segs = x.xpath('//seg')
segs.each do |seg|
puts "Seg: " + seg.text
end
end
This outputs:
Creation date: 20181001T113609Z
User: some_user
Attribute 1: Value1
Seg: Testing
Seg: Testiranje

How to iterate through nested xml elements using Nokogiri

I have an xml file which includes the nested elements below:
<SourceDetails>
<Origin>Origin</Origin>
<Identifier>Identifier</Identifier>
<Version>0</Version>
</SourceDetails>
I have already used the function at_xpath to extract the above xml snippet from an xml file which has been stored in a variable. Is it possible to iterate through this variable and store the contents of nested xml elements using Ruby Nokogiri? If so, how is this done?
I would like to append each element within SourceDetails to another variable followed by a forward slash. For the above example, I would like to get the content in the format Origin/Identifier/0
There is an easy way
require "nokogiri"
xmlFileData = Nokogiri::XML(File.open('./xmlFile.xml'))
dataArr = xmlFileData.at_xpath("//SourceDetails").text.split("\n")
dataArr.delete_at(0)
puts dataArr.join("/").gsub(/(\s+)/, '')
Here's a quick and dirty one. Since I'm not sure how you're storing your variable containing the XML, to be sure I'm getting the actual XML data I actually read the the XML data from a file, which gives us:
require 'nokogiri'
xml = File.open('source_of_xml.xml') { |f| Nokogiri::XML(f) }
values = []
xml.xpath('SourceDetails').each do |elem|
values << elem.text.gsub(/\n/, "").split
end
p values.first.join("/") #assing this to variable you want.
# => "Origin/Identifier/0"
Does this help or guide you in anyway?

Parsing XML with Ruby

I'm way new to working with XML but just had a need dropped in my lap. I have been given an usual (to me) XML format. There are colons within the tags.
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
It is a large file and there is much more to it than this but I hope this format will be familiar to someone. Does anyone know a way to approach an XML document of this sort?
I'd rather not just write a brute-force way of parsing the text but I can't seem to make any headway with REXML or Hpricot and I suspect it is due to these unusual tags.
my ruby code:
require 'hpricot'
xml = File.open( "myfile.xml" )
doc = Hpricot::XML( xml )
(doc/:things).each do |thg|
[ 'Id', 'Name' ].each do |el|
puts "#{el}: #{thg.at(el).innerHTML}"
end
end
...which is just lifted from: http://railstips.org/blog/archives/2006/12/09/parsing-xml-with-hpricot/
And I figured I would be able to figure some stuff out from here but this code returns nothing. It doens't error. It just returns.
As #pguardiario mentioned, Nokogiri is the de facto XML and HTML parsing library. If you wanted to print out the Id and Name values in your example, here is how you would do it:
require 'nokogiri'
xml_str = <<EOF
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name</PART1:Name>
</THING1:things>
EOF
doc = Nokogiri::XML(xml_str)
thing = doc.at_xpath('//things')
puts "ID = " + thing.at_xpath('//Id').content
puts "Name = " + thing.at_xpath('//Name').content
A few notes:
at_xpath is for matching one thing. If you know you have multiple items, you want to use xpath instead.
Depending on your document, namespaces can be problematic, so calling doc.remove_namespaces! can help (see this answer for a brief discussion).
You can use the css methods instead of xpath if you're more comfortable with those.
Definitely play around with this in irb or pry to investigate methods.
Resources
Parsing an HTML/XML document
Getting started with Nokogiri
Update
To handle multiple items, you need a root element, and you need to remove the // in the xpath query.
require 'nokogiri'
xml_str = <<EOF
<root>
<THING1:things type="Container">
<PART1:Id type="Property">1234</PART1:Id>
<PART1:Name type="Property">The Name1</PART1:Name>
</THING1:things>
<THING2:things type="Container">
<PART2:Id type="Property">2234</PART2:Id>
<PART2:Name type="Property">The Name2</PART2:Name>
</THING2:things>
</root>
EOF
doc = Nokogiri::XML(xml_str)
doc.xpath('//things').each do |thing|
puts "ID = " + thing.at_xpath('Id').content
puts "Name = " + thing.at_xpath('Name').content
end
This will give you:
Id = 1234
Name = The Name1
ID = 2234
Name = The Name2
If you are more familiar with CSS selectors, you can use this nearly identical bit of code:
doc.css('things').each do |thing|
puts "ID = " + thing.at_css('Id').content
puts "Name = " + thing.at_css('Name').content
end
If in a Rails environment, the Hash object is extended and one can take advantage of the the method from_xml:
xml = File.open("myfile.xml")
data = Hash.from_xml(xml)

How to parse XML to CSV where data is in attributes only

The XML file I am trying to parse has all the data contained in attributes. I found how to build the string to insert into the text file.
I have this XML file:
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
And I want to parse it into a text file like this with the class ref duplicated for each property:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
This is the code I have so far:
require 'nokogiri'
doc = Nokogiri::XML(File.open("file.xml"), 'UTF-8') do |config|
config.strict
end
content = doc.xpath("//ig:prescribed_item/#class_ref").map {|i|
i.search("//ig:prescribed_item/ig:prescribed_property/#property_ref").map { |d| d.text }
}
puts content.inspect
content.each do |c|
puts c.join('|')
end
I'd simplify it a bit using CSS accessors:
xml = <<EOT
<ig:prescribed_item class_ref="0161-1#01-765557#1">
<ig:prescribed_property property_ref="0161-1#02-016058#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
<ig:prescribed_property property_ref="0161-1#02-016059#1" is_required="false" combination_allowed="false" one_of_allowed="false">
<dt:measure_number_type representation_ref="0161-1#04-000005#1">
<dt:real_type>
<dt:real_format pattern="\d(1,)\.\d(1,)"/>
</dt:real_type>
<dt:prescribed_unit_of_measure UOM_ref="0161-1#05-003260#1"/>
</dt:measure_number_type>
</ig:prescribed_property>
</ig:prescribed_item>
</ig:identification_guide>
EOT
require 'nokogiri'
doc = Nokogiri::XML(xml)
data = [ %w[ class_ref property_ref is_required UOM_ref] ]
doc.css('|prescribed_item').each do |pi|
pi.css('|prescribed_property').each do |pp|
data << [
pi['class_ref'],
pp['property_ref'],
pp['is_required'],
pp.at_css('|prescribed_unit_of_measure')['UOM_ref']
]
end
end
puts data.map{ |row| row.join('|') }
Which outputs:
class_ref|property_ref|is_required|UOM_ref
0161-1#01-765557#1|0161-1#02-016058#1|false|0161-1#05-003260#1
0161-1#01-765557#1|0161-1#02-016059#1|false|0161-1#05-003260#1
Could you explain this line in greater detail "pp.at_css('|prescribed_unit_of_measure')['UOM_ref']"
In Nokogiri, there are two types of "find a node" methods: The "search" methods return all nodes that match a particular accessor as a NodeSet, and the "at" methods return the first Node of the NodeSet which will be the first encountered Node that matched the accessor.
The "search" methods are things like search, css, xpath and /. The "at" methods are things like at, at_css, at_xpath and %. Both search and at accept either XPath or CSS accessors.
Back to pp.at_css('|prescribed_unit_of_measure')['UOM_ref']: At that point in the code pp is a local variable containing a "prescribed_property" Node. So, I'm telling the code to find the first node under pp that matches the CSS |prescribed_unit_of_measure accessor, in other words the first <dt:prescribed_unit_of_measure> tag contained by the pp node. When Nokogiri finds that node, it returns the value of the UOM_ref attribute of the node.
As a FYI, the / and % operators are aliased to search and at respectively in Nokogiri. They're part of its "Hpricot" compatability; We used to use them a lot when Hpricot was the XML/HTML parser of choice, but they're not idiomatic for most Nokogiri developers. I suspect it's to avoid confusion with the regular use of the operators, at least it is in my case.
Also, Nokogiri's CSS accessors have some extra-special juiciness; They support namespaces, like the XPath accessors do, only they use |. Nokogiri will let us ignore the namespaces, which is what I did. You'll want to nose around in the Nokogiri docs for CSS and namespaces for more information.
There are definitely ways of parsing based on attributes.
The Engine yard article "Getting started with Nokogiri" has a full description.
But quickly, the examples they give are:
To match “h3″ tags that have a class
attribute, we write:
h3[#class]
To match “h3″ tags whose class
attribute is equal to the string “r”,
we write:
h3[#class = "r"]
Using the attribute matching
construct, we can modify our previous
query to:
//h3[#class = "r"]/a[#class = "l"]

How to replace every occurrence of a pattern in a string using Ruby?

I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.
So, I implemented this:
string.gsub!(/<(\w+) /) do |match|
case match
when 'Image' then 'Img'
when 'Text' then 'Txt'
end
end
puts string
which deletes all opening tags but does not do much else.
What am I doing wrong here?
Here's another way:
class String
def minimize_tags!
{"image" => "img", "text" => "txt"}.each do |from,to|
gsub!(/<#{from}\b/i,"<#{to}")
gsub!(/<\/#{from}>/i,"<\/#{to}>")
end
self
end
end
This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...
Here's the beauty of using a parser such as Nokogiri:
This lets you manipulate selected tags (nodes) and their attributes:
require 'nokogiri'
xml = <<EOT
<xml>
<Image ImagePath="path/to/image">image comment</Image>
<Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT
doc = Nokogiri::XML(xml)
doc.search('Image').each do |n|
n.name = 'img'
n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n|
n.name = 'txt'
n.attributes['TextFont'].name = 'font'
n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <img path="path/to/image">image comment</img>
# >> <txt font="courier" size="9">this is the text</txt>
# >> </xml>
If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.
The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.
Try this:
string.gsub!(/(<\/?)(\w+)/) do |match|
tag_mark = $1
case $2
when /^image$/i
"#{tag_mark}Img"
when /^text$/i
"#{tag_mark}Txt"
else
match
end
end

Resources