Handling an XML file with Ruby and Nokogiri - ruby

I am new to programming so bear with me. I have many XML documents that look like this:
File name: PRIDE_Exp_Complete_Ac_10094.xml.gz
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>None</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>GPM06600002310</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
I want to take out the text in between "Title", "ProtocolName", and "SampleName" and save into a text file that has the same name as the .xml.gz. I have the following code so far (based on posts I saw on this site), but it seems not to work:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml.gz"))
#ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?
Thanks

IF you are happy with REXML, AND there's only one <Experiment> per file, then something like the following should help ... (by the way, above text is invalid XML since no closing <ExperimentCollection> tag)
require "rexml/document"
include REXML
xml=<<EOD
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>Protein complexes in Saccharomyces cerevisiae (GPM06600002310)</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>None</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>GPM06600002310</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
EOD
doc = Document.new xml
doc.elements["Experiment/Title"].text
doc.elements["Experiment/Protocol/ProtocolName"].text
doc.elements["Experiment/mzData/description/admin/sampleName"].text

Related

Can't index data in alphabetical order in spanish alphabet before to select it in a query

I have a set of assets which had a property "name".
I want to get a dynamic number of those assets and I should get it alphabetically sorted by that "name" property.
I query that with this query:
type=dam:Asset
path=/content/dam/en/foobar/contacts/
orderby=#jcr:content/data/master/#name
orderby.sort=asc
p.limit=3
and this is working, so in a set of names:
[Paloma, Abel, José, Eduardo]
it retrieves:
Abel, Eduardo, José.
The problem is with spanish alphabet, in which Á is the same letter as A.
So in a set of:
[Paloma, Abel, José, Álvaro, Eduardo]
it retrieves:
Abel, Eduardo, José.
Being Álvaro excluded because its not part of the first 3 elements after ordeby it, when in should be the second, it should retrieve:
Abel, Álvaro, Eduardo.
So, to fix that, I've created a custom oak lucene index like below:
<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:oak="http://jackrabbit.apache.org/oak/ns/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:rep="internal"
jcr:mixinTypes="[rep:AccessControllable]"
jcr:primaryType="nt:unstructured">
<socialLucene/>
<workflowDataLucene/>
<slingeventJob/>
<jcrLanguage/>
<versionStoreIndex/>
<repMembers/>
<cqReportsLucene/>
<commerceLucene/>
<counter/>
<authorizables/>
<enablementResourceName/>
<externalPrincipalNames/>
<cmLucene/>
<foobarCFIndexFilter
jcr:primaryType="oak:QueryIndexDefinition"
async="[async,nrt]"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/content/dam/es/foobar,/content/dam/en/foobar]"
queryPaths="[/content/dam/es/foobar,/content/dam/en/foobar]"
reindex="{Boolean}false"
reindexCount="{Long}24"
seed="{Long}3850652403740003290"
type="lucene">
<analyzers jcr:primaryType="nt:unstructured">
<default jcr:primaryType="nt:unstructured">
<filters jcr:primaryType="nt:unstructured">
<Synonym
jcr:primaryType="nt:unstructured"
format="solr"
synonyms="synonyms.txt">
<synonyms.txt/>
</Synonym>
</filters>
<tokenizer
jcr:primaryType="nt:unstructured"
name="Classic"/>
</default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
<nt:base jcr:primaryType="nt:unstructured">
<properties jcr:primaryType="nt:unstructured">
<title
jcr:primaryType="nt:unstructured"
analyzed="{Boolean}true"
isRegexp="{Boolean}false"
name="jcr:content/data/master/title"
nodeScopeIndex="{Boolean}true"
ordered="{Boolean}true"
propertyIndex="{Boolean}true"
type="String"/>
<date
jcr:primaryType="nt:unstructured"
name="jcr:content/data/master/date"
ordered="{Boolean}true"
propertyIndex="{Boolean}true"/>
<sectors
jcr:primaryType="nt:unstructured"
name="jcr:content/data/master/sectors"
propertyIndex="{Boolean}true"/>
<contentFragment
jcr:primaryType="nt:unstructured"
name="jcr:content/contentFragment"
propertyIndex="{Boolean}true"/>
<model
jcr:primaryType="nt:unstructured"
name="cq:model"
propertyIndex="{Boolean}true"/>
<name
jcr:primaryType="nt:unstructured"
analyzed="{Boolean}true"
isRegexp="{Boolean}false"
name="jcr:content/data/master/name"
nodeScopeIndex="{Boolean}true"
ordered="{Boolean}true"
propertyIndex="{Boolean}true"
type="String"/>
</properties>
</nt:base>
</indexRules>
</foobarCFIndexFilter>
<cqProjectLucene/>
<ntFolderDamLucene/>
<acPrincipalName/>
<uuid/>
<damAssetLucene/>
<rep:policy/>
<cqPayloadPath/>
<nodetypeLucene/>
<nodetype/>
<ntBaseLucene/>
<reference/>
<principalName/>
<cqTagLucene/>
<lucene/>
<repTokenIndex/>
<externalId/>
<authorizableId/>
<cqPageLucene/>
</jcr:root>
Where in the synonyms.txt I had:
á, a
Á, A
and so on.
Also tried with a charFilter with Mapping equivalent chars.
I have made sure that my custom oak index is the one my query is using with Query Performance Diagnosis tool.
But nothing works, after reindex the query results are the same.
How to solve that?

How to Use multiple conditions in Xpath?

New to Xpath. Was trying in to use XML task in SSIS to load some values. Using Microsoft' XML inventory mentioned below.
How can I load first-name value in bookstore/books where style is novel and award = 'Pulitzer'?
//book[#style='novel' and ./author/award/text()='Pulitzer'] is what I am trying. It gives the whole element. Where should I modify to just get the first-name value?
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="myfile.xsl" ?>
<bookstore specialty="novel">
<book style="autobiography">
<author>
<first-name>Joe</first-name>
<last-name>Bob</last-name>
<award>Trenton Literary Review Honorable Mention</award>
</author>
<price>12</price>
</book>
<book style="textbook">
<author>
<first-name>Mary</first-name>
<last-name>Bob</last-name>
<publication>Selected Short Stories of
<first-name>Mary</first-name>
<last-name>Bob</last-name>
</publication>
</author>
<editor>
<first-name>Britney</first-name>
<last-name>Bob</last-name>
</editor>
<price>55</price>
</book>
<magazine style="glossy" frequency="monthly">
<price>2.50</price>
<subscription price="24" per="year"/>
</magazine>
<book style="novel" id="myfave">
<author>
<first-name>Toni</first-name>
<last-name>Bob</last-name>
<degree from="Trenton U">B.A.</degree>
<degree from="Harvard">Ph.D.</degree>
<award>P</award>
<publication>Still in Trenton</publication>
<publication>Trenton Forever</publication>
</author>
<price intl="Canada" exchange="0.7">6.50</price>
<excerpt>
<p>It was a dark and stormy night.</p>
<p>But then all nights in Trenton seem dark and
stormy to someone who has gone through what
<emph>I</emph> have.</p>
<definition-list>
<term>Trenton</term>
<definition>misery</definition>
</definition-list>
</excerpt>
</book>
<my:book xmlns:my="uri:mynamespace" style="leather" price="29.50">
<my:title>Who's Who in Trenton</my:title>
<my:author>Robert Bob</my:author>
</my:book>
</bookstore>
I got an answer.
//book[#style='novel' and ./author/award/text()='Pulitzer']//first-name
Use:
/*/book[#style='novel']/author[award = 'Pulitzer']/first-name
This selects any first-name element whose author parent has a award child with string value of 'Pulitzer' and whose (of the author) parent is a book whose style attribute has value "novel" and whose parent is the top element of the XML document.
A similar question in the same context. How can I do the vice-versa ? Let's suppose I want to find the id of all those books whose price is greater than 20 ? I know I am being a nudge, but really want to clear my understanding.
Here is the needed XPATH :
//book/price[text() > 20]/..

Exchange Server changing daily recurrence pattern to weekly?

I have registered an appointment in Outlook 2003 SP3 with recurrence pattern Daily, every workday, no end date.
The data are stored in MS Exchange Server 2010.
If I query Exchange Web Services for that event (some detail info) it returns a weekly occurrence for every Monday..Friday:
<Recurrence>
<WeeklyRecurrence>
<Interval>1</Interval>
<DaysOfWeek>Monday Tuesday Wednesday Thursday Friday</DaysOfWeek>
</WeeklyRecurrence>
<NoEndRecurrence>
<StartDate>2012-12-03+01:00</StartDate>
</NoEndRecurrence>
</Recurrence>
Technically, these are the same days, but I'm storing this in another system and would like an Outlook daily appointment to show up there as a daily appointment too ;-)
Is this a known issue?
Can anything be done to prevent this?
[I can't myself convert "Weekly Mon-Fri" back to "Daily every workday" because that would modify a 'real' "Weekly Mon-Fri" appointment]
Thanks
Jan
Full request:
<soapenv:Envelope
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:typ="http://schemas.microsoft.com/exchange/services/2006/types"
xmlns:mes="http://schemas.microsoft.com/exchange/services/2006/messages">
<soapenv:Header>
<typ:RequestServerVersion Version="Exchange2007_SP1"/>
<typ:MailboxCulture>en-US</typ:MailboxCulture>
</soapenv:Header>
<soapenv:Body>
<mes:GetItem>
<mes:ItemShape>
<typ:BaseShape>IdOnly</typ:BaseShape>
<typ:BodyType>Text</typ:BodyType>
<typ:AdditionalProperties>
<typ:FieldURI FieldURI="item:Subject" />
<typ:FieldURI FieldURI="item:ReminderIsSet" />
<typ:FieldURI FieldURI="item:ReminderMinutesBeforeStart" />
<typ:FieldURI FieldURI="calendar:Location" />
<typ:FieldURI FieldURI="calendar:IsAllDayEvent" />
<typ:FieldURI FieldURI="calendar:LegacyFreeBusyStatus" />
<typ:FieldURI FieldURI="calendar:Recurrence" />
<typ:FieldURI FieldURI="item:Body"/>
</typ:AdditionalProperties>
</mes:ItemShape>
<mes:ItemIds>
<typ:ItemId Id="AQMkAD[snip]2HQAAAA=="/>
</mes:ItemIds>
</mes:GetItem>
</soapenv:Body>
</soapenv:Envelope>
Full response:
<Envelope>
<Header>
<ServerVersionInfo MajorVersion="14" MinorVersion="0" MajorBuildNumber="722" MinorBuildNumber="0" Version="Exchange2010"/>
</Header>
<Body>
<GetItemResponse>
<ResponseMessages>
<GetItemResponseMessage ResponseClass="Success">
<ResponseCode>NoError</ResponseCode>
<Items>
<CalendarItem>
<ItemId Id="AQMkAD[snip]2HQAAAA==" ChangeKey="DwAAA[snip]ns8Yn"/>
<Subject>Elke werkdag, geen einddatum</Subject>
<Body BodyType="Text"/>
<ReminderIsSet>false</ReminderIsSet>
<ReminderMinutesBeforeStart>15</ReminderMinutesBeforeStart>
<IsAllDayEvent>false</IsAllDayEvent>
<LegacyFreeBusyStatus>Busy</LegacyFreeBusyStatus>
<Location/>
<Recurrence>
<WeeklyRecurrence>
<Interval>1</Interval>
<DaysOfWeek>Monday Tuesday Wednesday Thursday Friday</DaysOfWeek>
</WeeklyRecurrence>
<NoEndRecurrence>
<StartDate>2012-12-03+01:00</StartDate>
</NoEndRecurrence>
</Recurrence>
</CalendarItem>
</Items>
</GetItemResponseMessage>
</ResponseMessages>
</GetItemResponse>
</Body>
</Envelope>
After another hour of digging I found "Daily and Weekly recurrence pattern trouble" on a Microsoft forum stating that it is not possible:
"The only way to define a recurrence pattern for "Every weekday" in EWS is to use WeeklyRecurrencePatternType. DailyPatternType can only be used to define a recurrence where each occurrence happens N day after the previous one.
In other words, there is no way to distinguish the two in EWS."

I want to reprint the modified xml after deleting entire child node

<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
I want to remove the "pencil" node and print the remaining xml using REXML (Ruby). Can anybody tell me how to do that ?
By using one of the delete methods http://rubydoc.info/stdlib/rexml/
require "rexml/document"
string = <<EOF
<product>
<book>
<id>111</id>
<name>xxx</name>
</book>
<pen>
<id>222</id>
<name>yyy</name>
</pen>
<pencil>
<id>333</id>
<name>zzz</name>
</pencil>
</product>
EOF
doc = REXML::Document.new(string)
doc.delete_element('//pencil')
puts doc
There is also nice tutorial to get you started: http://www.germane-software.com/software/rexml/docs/tutorial.html

traversing ruby map issues

I'm pulling the following XML from mediawiki API
<?xml version="1.0"?>
<api>
<query>
<pages>
<page pageid="309311" ns="0" title="Chenonetta jubata">
<images>
<im ns="6" title="File:Australian Wood Duck.jpg" />
<im ns="6" title="File:Australian Wood Duck Female.JPG" />
<im ns="6" title="File:Australian Wood Duck Male.JPG" />
...
</images>
</page>
</pages>
</query>
</api>
and reading it into a Ruby map using xmlSimple. The data which I'm really trying to get is the image names from the images section but when I attempt to go past the query level with
x= result['query']['pages']
puts x
I'm getting the following error:
in `[]': can't convert String into Integer (TypeError)
what am I doing wrong?
Thanks,
m
I used Nokogiri in the end which allows xpath notation to traverse the xml tree.
e.g.
licenseinfo = results3.xpath("//api/query/pages/page/categories/cl/#title")

Resources