Parsing with ruby - ruby

I am new to ruby and I have a school project were I am parsing a xml file and need to get data after certain tags. I can only use core ruby. No gems
pFile = File.open("myfile.mzML", "r")
regmsLvl = "ms level\" value=\""
pFile.each_line { |line|
scn = line.scan(/#{regmsLvl}(\d)/)
#what I want to do but doesn't work
if scn == 1
puts("Got it!")
end
#what I have to do to compare if == 1
if scn != nil
scn.each do |val|
if val[0].to_i == 1
puts("Got it!")
end
end
end
}
# a sample line that I am parsing is:
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" />
This seems silly.
line.scans out put makes scn a 2d array. How can I just have it be a string that gets overridden each pass. Or how should I change this whole thing. Any suggestions are appreciated.
puts(scn) prints out the 1 but if I do scn == 1 or scn.to_i == 1 it never gets into the if. I have tried scn.pop and scn.pop.pop
I have added a section to show what I am trying to do now.
I need to check the ms level: if 1 then get scan start time and then the binary. This is the code that I am now working with.
xmlfile = File.new("afile.mzML")
xmldoc = Document.new(xmlfile)
root = xmldoc.root
puts "Root element : " + root.attributes["xmlns"]
xmldoc.elements.each("mzML/run/spectrumList/spectrum/cvParam"){
|e| if e.attributes["value"].to_i ==1
# Now I need to get start time: #
["mzML/run/spectrumList/spectrum/cvParam/scanList/scan/value"]
# and then
["mzML/run/spectrumList/spectrum/cvParam/binaryDataArrayList/binaryDataArray/binary"]
end
}
<run id="ru_0" defaultInstrumentConfigurationRef="ic_0" sampleRef="sa_0" defaultSourceFileRef="sf_ru_0">
<spectrumList count="3310" defaultDataProcessingRef="dp_sp_0">
<spectrum id="scan=8839" index="0" defaultArrayLength="171" dataProcessingRef="dp_sp_0">
<cvParam cvRef="MS" accession="MS:1000525" name="spectrum representation" />
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" />
<cvParam cvRef="MS" accession="MS:1000294" name="mass spectrum" />
<cvParam cvRef="MS" accession="MS:1000130" name="positive scan" />
<scanList count="1">
<cvParam cvRef="MS" accession="MS:1000795" name="no combination" />
<scan>
<cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="5429.47" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
</scan>
</scanList>
<binaryDataArrayList count="2">
<binaryDataArray encodedLength="1824">
<cvParam cvRef="MS" accession="MS:1000514" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<binary>AAAAQBCdgkAAAACAP6KCQAAAAAA8pIJAAAAAYAWlgkAAAABgQ6aCQAAAAGCzp4JAAAAAQEaogkAAAACgDKqCQAAAAEAgqoJAAAAAwEOqgkAAAABAWKqCQAAAAGBErIJAAAAAIOetgkAAAABAMLCCQAAAAGDlsYJAAAAA4DeygkAAAACAw7SCQAAAACBauIJAAAAAwFC6gkAAAACAYb6CQAAAAIDnwYJAAAAAwDjHgkAAAAAATMyCQAAAAADnzIJAAAAAAArOgkAAAACgTc6CQAAAAKBqzoJAAAAAQJLPgkAAAACAVNCCQAAAAAAK0oJAAAAAIF7SgkAAAADABNSCQAAAAKAx1YJAAAAAYHXXgkAAAAAg3teCQAAAAOAf2oJAAAAAICbcgkAAAAAAx92CQAAAAKA03oJAAAAAIBXigkAAAABAO+KCQAAAAKCr5YJAAAAAYMnlgkAAAADgK+aCQAAAAKDq6YJAAAAAAC3qgkAAAACgNe6CQAAAAMCA74JAAAAAANL0gkAAAAAAUfiCQAAAAOCt+YJAAAAA4O75gkAAAACAPPqCQAAAAGBq/oJAAAAAwEQCg0AAAABAKAqDQAAAAAAoDoNAAAAA4G0Og0AAAADAZhKDQAAAACCBEoNAAAAAwIQWg0AAAABAjheDQAAAAMA+GoNAAAAAQIYag0AAAAAA7RyDQAAAAEB9HYNAAAAAwIseg0AAAADgbyKDQAAAAAAPJINAAAAAgEUlg0AAAACgYCaDQAAAAOBfKoNAAAAA4DAug0AAAADAZi+DQAAAAAA0MINAAAAAoFMwg0AAAAAgMjKDQAAAACA2NINAAAAAgDk2g0AAAAAg+DyDQAAAAOAfPoNAAAAAAKU/g0AAAAAgQUKDQAAAAKBVQoNAAAAAYNRHg0AAAAAgf0qDQAAAAICZSoNAAAAAIDFQg0AAAAAgM1KDQAAAAEBjUoNAAAAAoGNUg0AAAAAAZ1aDQAAAAABqWINAAAAAYHhZg0AAAACAfl2DQAAAAEAcXoNAAAAAICpfg0AAAADgw2GDQAAAAACmZ4NAAAAAQDRog0AAAABAiWqDQAAAAAAibYNAAAAAQHpug0AAAABAEnKDQAAAAABCcoNAAAAAoHxyg0AAAACgGXaDQAAAAMBDdoNAAAAAgJR2g0AAAAAgHHqDQAAAAEBGeoNAAAAAIHh6g0AAAABAl3qDQAAAAKCkfYNAAAAAYE5+g0AAAAAAm36DQAAAAEDigYNAAAAAQGWCg0AAAABAjYKDQAAAACClgoNAAAAA4ESGg0AAAABgYIaDQAAAAMDSh4NAAAAAYCqIg0AAAADAT4qDQAAAAACCioNAAAAAwJmOg0AAAABAnZKDQAAAAKDJlINAAAAAgHGWg0AAAABgl5eDQAAAAEB4mINAAAAA4B2eg0AAAADgKKCDQAAAAGAvooNAAAAAwJakg0AAAABAUaiDQAAAAGBgqoNAAAAAIBatg0AAAADAxa6DQAAAAKCosoNAAAAAICy6g0AAAAAAbrqDQAAAAACRuoNAAAAAAMa/g0AAAACgOsCDQAAAAABzwoNAAAAAIOTCg0AAAACADcWDQAAAAGB4xoNAAAAAQOfGg0AAAAAAvceDQAAAAEBZyoNAAAAA4OnKg0AAAAAgMs6DQAAAAOC/z4NAAAAAYInUg0AAAABgftaDQAAAAODC1oNAAAAAwJXXg0AAAAAAgdiDQAAAAKA/2oNAAAAAoILag0AAAABghtyDQAAAAGCm3INAAAAAAO7cg0AAAACgr9+DQAAAAGCY4oNAAAAAgDbkg0AAAABAN+WDQAAAAKBU5oNA</binary>
</binaryDataArray>

I think you were pretty close. Assuming you can use that REXML library (which looks like it's part of the core ruby library) you should be able to do this
require 'rexml/document'
xmlfile = File.new("afile.mzML")
xmldoc = REXML::Document.new(xmlfile)
root = xmldoc.root
start_time = nil
binary = nil
# get the ms level
ms_level = root.elements["spectrumList/spectrum/cvParam[#name='ms level']"].attributes["value"].to_i
if ms_level == 1
# get the scan start time
start_time = root.elements["spectrumList/spectrum/scanList/scan/cvParam[#name='scan start time']"].attributes["value"]
# get the binary
binary = root.elements["spectrumList/spectrum/binaryDataArrayList/binaryDataArray/binary"].text
end
p start_time # => "5429.47"
p binary # => that crazy long binary
This REXML tutorial is helpful: http://www.germane-software.com/software/rexml/docs/tutorial.html
Note, I made a few assumptions, like the elements would always exist, the ms level was always an int, the file structure is always the same. Those assumptions may not be true in your situation but this should be a start.

Related

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?
Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

Ruby Hash parsed_response error

BACKGROUND
I am using HTTParty to parse an XML hash response. Unfortunately, when the hash response only has one entry(?), the resulting hash is not indexable. I have confirmed the resulting XML syntax is the same for single and multiple entry(?). I have also confirmed my code works when there are always multiple entries(?) in the hash.
QUESTION
How do I accommodate the single hash entry case and/or is there an easier way to accomplish what I am trying to do?
CODE
require 'httparty'
class Rest
include HTTParty
format :xml
end
def test_redeye
# rooms and devices
roomID = Hash.new
deviceID = Hash.new { |h,k| h[k] = Hash.new }
rooms = Rest.get(#reIp["theater"] + "/redeye/rooms/").parsed_response["rooms"]
puts "rooms #{rooms}"
rooms["room"].each do |room|
puts "room #{room}"
roomID[room["name"].downcase.strip] = "/redeye/rooms/" + room["roomId"]
puts "roomid #{roomID}"
devices = Rest.get(#reIp["theater"] + roomID[room["name"].downcase.strip] + "/devices/").parsed_response["devices"]
puts "devices #{devices}"
devices["device"].each do |device|
puts "device #{device}"
deviceID[room["name"].downcase.strip][device["displayName"].downcase.strip] = "/devices/" + device["deviceId"]
puts "deviceid #{deviceID}"
end
end
say "Done"
end
XML - SINGLE ENTRY
<?xml version="1.0" encoding="UTF-8" ?>
<devices>
<device manufacturerName="Philips" description="" portType="infrared" deviceType="0" modelName="" displayName="TV" deviceId="82" />
</devices>
XML - MULTIPLE ENTRY
<?xml version="1.0" encoding="UTF-8" ?>
<devices>
<device manufacturerName="Denon" description="" portType="infrared" deviceType="6" modelName="Avr-3311ci" displayName="AVR" deviceId="77" />
<device manufacturerName="Philips" description="" portType="infrared" deviceType="0" modelName="" displayName="TV" deviceId="82" />
</devices>
RESULTING ERROR
[Info - Plugin Manager] Matches, executing block
rooms {"room"=>[{"name"=>"Home Theater", "currentActivityId"=>"78", "roomId"=>"-1", "description"=>""}, {"name"=>"Living", "currentActivityId"=>"-1", "roomId"=>"81", "description"=>"2nd Floor"}, {"name"=>"Theater", "currentActivityId"=>"-1", "roomId"=>"80", "description"=>"1st Floor"}]}
room {"name"=>"Home Theater", "currentActivityId"=>"78", "roomId"=>"-1", "description"=>""}
roomid {"home theater"=>"/redeye/rooms/-1"}
devices {"device"=>[{"manufacturerName"=>"Denon", "description"=>"", "portType"=>"infrared", "deviceType"=>"6", "modelName"=>"Avr-3311ci", "displayName"=>"AVR", "deviceId"=>"77"}, {"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}]}
device {"manufacturerName"=>"Denon", "description"=>"", "portType"=>"infrared", "deviceType"=>"6", "modelName"=>"Avr-3311ci", "displayName"=>"AVR", "deviceId"=>"77"}
deviceid {"home theater"=>{"avr"=>"/devices/77"}}
device {"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}
deviceid {"home theater"=>{"avr"=>"/devices/77", "tv"=>"/devices/82"}}
room {"name"=>"Living", "currentActivityId"=>"-1", "roomId"=>"81", "description"=>"2nd Floor"}
roomid {"home theater"=>"/redeye/rooms/-1", "living"=>"/redeye/rooms/81"}
devices {"device"=>{"manufacturerName"=>"Philips", "description"=>"", "portType"=>"infrared", "deviceType"=>"0", "modelName"=>"", "displayName"=>"TV", "deviceId"=>"82"}}
device ["manufacturerName", "Philips"]
/usr/local/rvm/gems/ruby-1.9.3-p374#SiriProxy/gems/siriproxy-0.3.2/plugins/siriproxy-redeye/lib/siriproxy-redeye.rb:145:in `[]': can't convert String into Integer (TypeError)
There are a couple of options I see. If you control the endpoint, you could modify the XML being sent to accomodate HTTParty's underlying XML parser, Crack by putting a type="array" attribute on the devices XML element.
Otherwise, you could check to see what class the device is before indexing into it:
case devices["device"]
when Array
# act on the collection
else
# act on the single element
end
It's much less than ideal whenever you have to do type-checking in a dynamic language, so if you find yourself doing this more than once it may be worth introducing polymorphism or at the very least extracting a method to do this.

Should Nokogiri::XML.parse be creating separate Text nodes for linefeeds?

I have an XML document created by an outside tool:
<?xml version="1.0" encoding="UTF-8"?>
<suite>
<id>S1</id>
<name>First Suite</name>
<description></description>
<sections>
<section>
<name>section 1</name>
<cases>
<case>
<id>C1</id>
<title>Test 1.1</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
<case>
<id>C2</id>
<title>Test 1.2</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
</cases>
</section>
</sections>
</suite>
From irb, I do the following: (Output suppressed until final command)
> require('nokogiri')
> doc = Nokogiri::XML.parse(open('./test.xml'))
> test_case = doc.search('case').first
=> #<Nokogiri::XML::Element:0x3ff75851bc44 name="case" children=[#<Nokogiri::XML::Text:0x3ff75851b8fc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b7bc name="id" children=[#<Nokogiri::XML::Text:0x3ff75851b474 "C1">]>, #<Nokogiri::XML::Text:0x3ff75851b1cc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b078 name="title" children=[#<Nokogiri::XML::Text:0x3ff75851ad58 "Test 1.1">]>, #<Nokogiri::XML::Text:0x3ff75851aa9c "\n ">, #<Nokogiri::XML::Element:0x3ff75851a970 name="type" children=[#<Nokogiri::XML::Text:0x3ff75851a6c8 "Other">]>, #<Nokogiri::XML::Text:0x3ff7585191d8 "\n ">, #<Nokogiri::XML::Element:0x3ff7585190d4 name="priority" children=[#<Nokogiri::XML::Text:0x3ff758518d64 "4 - Must Test">]>, #<Nokogiri::XML::Text:0x3ff758518ad0 "\n ">, #<Nokogiri::XML::Element:0x3ff7585189a4 name="estimate">, #<Nokogiri::XML::Text:0x3ff758518670 "\n ">, #<Nokogiri::XML::Element:0x3ff758518558 name="milestone">, #<Nokogiri::XML::Text:0x3ff7585182b0 "\n ">, #<Nokogiri::XML::Element:0x3ff758518184 name="references">, #<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">]>
This results in a number of children that look like the following:
#<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">
I want to iterate through these XML nodes without having to do something like:
> real_nodes = test_case.children.reject{|n| n.node_name == 'text' && n.content.strip!.empty?}
I couldn't find a parse parameter in the Nokogiri docs to suppress the treating of newlines as separate nodes. Is there a way to do this during the parse instead of after?
Check the documentation. You can just do this:
doc = Nokogiri::XML.parse(open('./test.xml')) do |config|
config.noblanks
end
That will load the file without any empty nodes.
The text nodes are the result of pretty-printing the XML. The spec doesn't require whitespace between tags, and, for efficiency, a huge XML file could be stripped of inter-tag whitespace to save space and reduce transfer time, without sacrificing the data content.
This might show what's happening:
require 'nokogiri'
xml = '<foo></foo>'
Nokogiri::XML(xml).at('foo').child
=> nil
With no whitespace between the tags there's no text node either.
xml = '<foo>
</foo>'
Nokogiri::XML(xml).at('foo').child
=> #<Nokogiri::XML::Text:0x3fcee9436ff0 "\n">
doc.at('foo').child.class
=> Nokogiri::XML::Text
With whitespace for pretty-printing, the XML has a text node following the foo tag.

REXML parsing an XML in ruby

Folks,
I am using REXML for a sample XML file:
<Accounts title="This is the test title">
<Account name="frenchcustomer">
<username name = "frencu"/>
<password pw = "hello34"/>
<accountdn dn = "https://frenchcu.com/"/>
<exporttest name="basic">
<exportname name = "basicexport"/>
<exportterm term = "oldschool"/>
</exporttest>
</Account>
<Account name="britishcustomer">
<username name = "britishcu"/>
<password pw = "mellow34"/>
<accountdn dn = "https://britishcu.com/"/>
<exporttest name="existingsearch">
<exportname name = "largexpo"/>
<exportterm term = "greatschool"/>
</exporttest>
</Account>
</Accounts>
I am reading the XML like this:
#data = (REXML::Document.new file).root
#dataarr = ##testdata.elements.to_a("//Account")
Now I want to get the username of the frenchcustomer, so I tried this:
#dataarr[#name=fenchcustomer].elements["username"].attributes["name"]
this fails, I do not want to use the array index, for example
#dataarr[1].elements["username"].attributes["name"]
will work, but I don't want to do that, is there something that i m missing here. I want to use the array and get the username of the french user using the Account name.
Thanks a lot.
I recommend you to use XPath.
For the first match, you can use first method, for an array, just use match.
The code above returns the username for the Account "frenchcustomer" :
REXML::XPath.first(yourREXMLDocument, "//Account[#name='frenchcustomer']/username/#name").value
If you really want to use the array created with ##testdata.elements.to_a("//Account"), you could use find method :
french_cust_elt = the_array.find { |elt| elt.attributes['name'].eql?('frenchcustomer') }
french_username = french_cust_elt.elements["username"].attributes["name"]
puts #data.elements["//Account[#name='frenchcustomer']"]
.elements["username"]
.attributes["name"]
If you want to iterate over multiple identical names:
#data.elements.each("//Account[#name='frenchcustomer']") do |fc|
puts fc.elements["username"].attributes["name"]
end
I don't know what your ##testdata are, I tried with the following testcode:
require "rexml/document"
#data = (REXML::Document.new DATA).root
#dataarr = #data.elements.to_a("//Account")
# Works
p #dataarr[1].elements["username"].attributes["name"]
#Works not
#~ p #dataarr[#name='fenchcustomer'].elements["username"].attributes["name"]
##dataarr is an array
#dataarr.each{|acc|
next unless acc.attributes['name'] =='frenchcustomer'
p acc.elements["username"].attributes["name"]
}
##dataarr is an array
puts "===Array#each"
#dataarr.each{|acc|
next unless acc.attributes['name'] =='frenchcustomer'
p acc.elements["username"].attributes["name"]
}
puts "===XPATH"
#data.elements.to_a("//Account[#name='frenchcustomer']").each{|acc|
p acc.elements["username"].attributes["name"]
}
__END__
<Accounts title="This is the test title">
<Account name="frenchcustomer">
<username name = "frencu"/>
<password pw = "hello34"/>
<accountdn dn = "https://frenchcu.com/"/>
<exporttest name="basic">
<exportname name = "basicexport"/>
<exportterm term = "oldschool"/>
</exporttest>
</Account>
<Account name="britishcustomer">
<username name = "britishcu"/>
<password pw = "mellow34"/>
<accountdn dn = "https://britishcu.com/"/>
<exporttest name="existingsearch">
<exportname name = "largexpo"/>
<exportterm term = "greatschool"/>
</exporttest>
</Account>
</Accounts>
I'm not very familiar with rexml, so I expect there is a better solution. But perhaps aomebody can take my code to build a better solution.

Trying to parse a XML using Nokogiri with Ruby

I am new to programming so bear with me. I have an XML document that looks like this:
File name: PRIDE1542.xml
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>**Protein complexes in Saccharomyces cerevisiae (GPM06600002310)**</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>**None**</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>**GPM06600002310**</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
</ExperimentCollection>
I want to take out the text in between <Title>, <ProtocolName>, and <SampleName> and put into a text file (I tried bolding them to making it easier to see). I have the following code so far (based on posts I saw on this site), but it seems not to work:
>> require 'rubygems'
>> require 'nokogiri'
>> doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml"))
>> #ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?
Try to access them using xpath expressions. You can enter the path through the parse tree using slashes.
puts doc.xpath( "/ExperimentCollection/Experiment/Title" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/Protocol/ProtocolName" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text

Resources