Trying to parse a XML using Nokogiri with Ruby - ruby

I am new to programming so bear with me. I have an XML document that looks like this:
File name: PRIDE1542.xml
<ExperimentCollection version="2.1">
<Experiment>
<ExperimentAccession>1015</ExperimentAccession>
<Title>**Protein complexes in Saccharomyces cerevisiae (GPM06600002310)**</Title>
<ShortLabel>GPM06600002310</ShortLabel>
<Protocol>
<ProtocolName>**None**</ProtocolName>
</Protocol>
<mzData version="1.05" accessionNumber="1015">
<cvLookup cvLabel="RESID" fullName="RESID Database of Protein Modifications" version="0.0" address="http://www.ebi.ac.uk/RESID/" />
<cvLookup cvLabel="UNIMOD" fullName="UNIMOD Protein Modifications for Mass Spectrometry" version="0.0" address="http://www.unimod.org/" />
<description>
<admin>
<sampleName>**GPM06600002310**</sampleName>
<sampleDescription comment="Ho, Y., et al., Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002 Jan 10;415(6868):180-3.">
<cvParam cvLabel="NEWT" accession="4932" name="Saccharomyces cerevisiae (Baker's yeast)" value="Saccharomyces cerevisiae" />
</sampleDescription>
</admin>
</description>
<spectrumList count="0" />
</mzData>
</Experiment>
</ExperimentCollection>
I want to take out the text in between <Title>, <ProtocolName>, and <SampleName> and put into a text file (I tried bolding them to making it easier to see). I have the following code so far (based on posts I saw on this site), but it seems not to work:
>> require 'rubygems'
>> require 'nokogiri'
>> doc = Nokogiri::XML(File.open("PRIDE_Exp_Complete_Ac_10094.xml"))
>> #ExperimentCollection = doc.css("ExperimentCollection Title").map {|node| node.children.text }
Can someone help me?

Try to access them using xpath expressions. You can enter the path through the parse tree using slashes.
puts doc.xpath( "/ExperimentCollection/Experiment/Title" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/Protocol/ProtocolName" ).text
puts doc.xpath( "/ExperimentCollection/Experiment/mzData/description/admin/sampleName" ).text

Related

How to combine two XML files with Nokogiri

I am trying to combine two separate, but related, files with Nokogiri. I want to combine the "product" and "product pricing" if "ItemNumber" is the same.
I loaded the documents, but I have no idea how to combine the two.
Product File:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
</Products>
Product Pricing Fields:
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
I am looking to generate a file like this:
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ProductTypeId>0</ProductTypeId>
<Description>This single bit curved grip axe handle is made for 3 to 5 pound axes. A good quality replacement handle made of American hickory with a natural wax finish. Hardwood handles do not conduct electricity and American Hickory is known for its strength, elasticity and ability to absorb shock. These handles provide exceptional value and economy for homeowners and other occasional use applications. Each Link handle comes with the required wedges, rivets, or epoxy needed for proper application of the tool head.</Description>
<ActiveFlag>Y</ActiveFlag>
<ImageFile>100024.jpg</ImageFile>
<ItemNumber>100024</ItemNumber>
<ProductVariants>
<ProductVariant>
<Sku>100024</Sku>
<ColorName></ColorName>
<SizeName></SizeName>
<SequenceNo>0</SequenceNo>
<BackOrderableFlag>N</BackOrderableFlag>
<InventoryLevel>0</InventoryLevel>
<ColorCode></ColorCode>
<SizeCode></SizeCode>
<TaxableFlag>Y</TaxableFlag>
<VariantPromoGroupCode></VariantPromoGroupCode>
<PricingGroupCode></PricingGroupCode>
<StartDate xsi:nil="true"></StartDate>
<EndDate xsi:nil="true"></EndDate>
<ActiveFlag>Y</ActiveFlag>
</ProductVariant>
</ProductVariants>
</Product>
<ProductPricing>
<ItemNumber>100024</ItemNumber>
<AcquisitionCost>8.52</AcquisitionCost>
<MemberCost>10.7</MemberCost>
<Price>14.99</Price>
<SalePrice xsi:nil="true"></SalePrice>
<SaleCode>0</SaleCode>
</ProductPricing>
</Products>
Here is the code I have so far:
require 'csv'
require 'nokogiri'
xml = File.read('lateApril-product-pricing.xml')
xml2 = File.read('lateApril-master-date')
doc = Nokogiri::XML(xml)
doc2 = Nokogiri::XML(xml2)
pricing_data = []
item_number = []
doc.xpath('//ProductsPricing/ProductPricing').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text
variant_Price = file.xpath('./Price').first.text
pricing_data << [ itemNumber, variant_Price ]
item_number << [ itemNumber ]
end
puts item_number ## This prints all the item number but i have no idea how to loop through them and combine them with Product XML
doc2.xpath('//Products/Product').each do |file|
itemNumber = file.xpath('./ItemNumber').first.text #not sure how to write the conditions here since i don't have pricing fields available in this method
end
Try this on:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
doc1.at('Product').add_next_sibling(doc2.at('ProductPricing'))
Which results in:
puts doc1.to_xml
# >> <?xml version="1.0"?>
# >> <Products>
# >> <Product>
# >> <Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
# >> </Product><ProductPricing>
# >> <ItemNumber>100024</ItemNumber>
# >> </ProductPricing>
# >> </Products>
Please, when you ask, strip the example input and expected resulting output to the absolute, bare, minimum. Anything beyond that wastes space, eye-time and brain CPU.
This is untested code, but is where I'd start if I was going to merge two files containing multiple <ItemNumber> nodes:
require 'nokogiri'
doc1 = Nokogiri::XML(<<EOT)
<Products>
<Product>
<Name>36-In. Homeowner Bent Single-Bit Axe Handle</Name>
<ItemNumber>100024</ItemNumber>
</Product>
</Products>
EOT
doc2 = Nokogiri::XML(<<EOT)
<ProductPricing>
<ItemNumber>100024</ItemNumber>
</ProductPricing>
EOT
# build a hash containing the item numbers in doc1 for each product
doc1_products_by_item_numbers = doc1.search('Product').map { |product|
item_number = product.at('ItemNumber').value
[
item_number,
product
]
}.to_hash
# build a hash containing the item numbers in doc2 for each product pricing
doc2_products_by_item_numbers = doc2.search('ProductPricing').map { |pricing|
item_number = pricing.at('ItemNumber').value
[
item_number,
pricing
]
}.to_hash
# append doc2 entries to doc1 after each product based on item numbers
doc1_products_by_item_numbers.keys.each { |k|
doc1_products_by_item_numbers[k].add_next_sibling(doc2_products_by_item_numbers[k])
}

Ruby nokogiri attribute selector in XML file

this is the xml file:
<?xml version="1.0" encoding="UTF-8"?>
<SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<ns1:putResponse
xmlns:ns1="urn:DmsManagerClient">
<result xsi:type="xsd:string">
<?xml version="1.0" encoding="ISO-8859-1"?>
<MESSAGE ID="11c73b9e-687c-4300-baba-b743c26f7c83" TYPE="CUSDMS">
<DELIVERY>
<FROM>
<SENDER>0072000</SENDER>
<SERVICE>eService</SERVICE>
<DATE>2019-03-08T12:27:25</DATE>
</FROM>
<TO>
<DEALER DEALERCODE="0072000" MARKETCODE="1000"/>
</TO>
</DELIVERY>
<CONTENT>
<dms:ComplexResponse ErrorCode="430" ErrorDescription="null : PrivacyUE Mancante" Return="false"
xmlns:dms="http://dmsmanagerservice">
<dms:Element Name="DMSVERSION">2.7</dms:Element>
</dms:ComplexResponse>
</CONTENT>
</MESSAGE>
</result>
</ns1:putResponse>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
I am coding with Ruby and I used Nokogiri and the method xpath to extrapole the "CONTENT" of the file
this is the code:
def extrapolate_error(xml)
doc = Nokogiri::XML(File.open(xml))
doc.xpath('//CONTENT')
end
and this is the result:
[#<Nokogiri::XML::Element:0x1c5ba78 name="CONTENT" children=[
#<Nokogiri::XML::Text:0x1c5b940 "\n">,
#<Nokogiri::XML::Element:0x1c5b8bc name="ComplexResponse" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[
#<Nokogiri::XML::Attr:0x1c5b874 name="ErrorCode" value="430">,
#<Nokogiri::XML::Attr:0x1c5b868 name="ErrorDescription" value="null : PrivacyUE Mancante">,
#<Nokogiri::XML::Attr:0x1c5b85c name="Return" value="false">]
children=[#<Nokogiri::XML::Text:0x1c5b118 "\n">,
#<Nokogiri::XML::Element:0x1c5b094 name="Element" namespace=#<Nokogiri::XML::Namespace:0x1c5b88c prefix="dms" href="http://dmsmanagerservice">
attributes=[#<Nokogiri::XML::Attr:0x1c5b058 name="Name" value="DMSVERSION">]
children=[#<Nokogiri::XML::Text:0x1c5abe4 "2.7">]>,
#<Nokogiri::XML::Text:0x1c5aaac "\n">]>,
#<Nokogiri::XML::Text:0x1c5a974 "\n">]>]
Now I need to enter in it and select some attributes.
In the specific I need this:
name="ErrorCode" value="430"
name="ErrorDescription" value="null : PrivacyUE Mancante"
I do not know how to procceed. Can you help me?
The following should work for you assuming the dms namespace is always the same
doc.xpath('//CONTENT/dms:ComplexResponse', dms: 'http://dmsmanagerservice')
.xpath('#ErrorCode | #ErrorDescription')
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
#=> {"ErrorCode"=>"430", "ErrorDescription"=>"null : PrivacyUE Mancante"}
You already understand how you got to //CONTENT so from there we use dms:ComplexResponse to navigate deeper but since this is namespaced we have to provide the namespace reference e.g. dms: 'http://dmsmanagerservice'.
Then we select the attributes we are interested in #ErrorCode and #ErrorDescription.
In XPath the pipe | means UNION (think AND) so we want to select both.
Then we are just building a Hash using the name as the key and the text as the value.
XPath Cheatsheet - Useful resource if you need additional reference
Update
You asked about conditionals so this is what I would propose
ndoc = Nokogiri::XML(doc)
namespaces = ndoc.collect_namespaces
response = ndoc.xpath("//CONTENT/dms:ComplexResponse", namespaces)
if response.xpath("self::node()[#ErrorCode != '' and #ErrorDescription != '']").any?
response.xpath("#ErrorCode | #ErrorDescription")
.each_with_object({}) do |e,obj|
obj[e.name] = e.text
end
else
response.xpath('dms:Element/#Name | dms:Element/text()',namespaces)
.each_slice(2)
.map {|s| s.map(&:text)}.to_h
end
This checks to see if there is an ErrorCode and and ErrorDescription if so then Hash as originally proposed. If Not then it returns all the dms:Elements as a Hash so {"DMSVERSION"=>"2.7"} in this case Functional Example

Ruby Nested Hashes - Determining separate events

I have many unique events in XML that I have converted into a large Hash of around 300 keys. The values of most of these keys are Hashes and again, the value of some of those keys are Hashes again. I do not know how deep the hash nesting will go.
I would like to write each Hash of the original 300 and all of its keys & values (no matter how many it may have) to a schema-less database.
I have managed to write a (messy) method that outputs the values of each Hash, no matter how many Hashes its values may contain.
The problem that I am now faced with is that I am unable to determine where one Hash starts, and one Hash ends. Therefore I am unable to write separate events to the database as I am just left with the output of all my Hashes.
How can I determine which are separate events?
Here is my code:
require 'crack'
require 'awesome_print'
def printingOutHash(inputHash)
#ap inputHash
if inputHash.kind_of?(Array)
puts "array"
inputHash.each do |x|
printingOutHash(x)
end
end
if inputHash.kind_of?(Hash)
inputHash.each do |k, v|
if v.kind_of?(Hash)
printingOutHash(v)
else
ap "#{k}: #{v}"
end
end
end
end
h = Crack::XML.parse("<Events><Event><System><Provider Name='Service Control Manager' Guid='{555908d1-a6d7-4695-8e1e-26931d2012f4}' EventSourceName='Service Control Manager'/><EventID Qualifiers='16384'>7036</EventID><Version>0</Version><Level>4</Level><Task>0</Task><Opcode>0</Opcode><Keywords>0x8080000000000000</Keywords><TimeCreated SystemTime='2013-03-25T05:00:38.021800000Z'/><EventRecordID>17629</EventRecordID><Correlation/><Execution ProcessID='476' ThreadID='3028'/><Channel>System</Channel><Computer>AMAZONA-ONIST5V</Computer><Security/></System><EventData><Data Name='param1'>Windows Modules Installer</Data><Data Name='param2'>stopped</Data><Binary>540072007500730074006500640049006E007300740061006C006C00650072002F0031000000</Binary></EventData></Event><Event><System><Provider Name='Service Control Manager' Guid='{555908d1-a6d7-4695-8e1e-26931d2012f4}' EventSourceName='Service Control Manager'/><EventID Qualifiers='16384'>7040</EventID><Version>0</Version><Level>4</Level><Task>0</Task><Opcode>0</Opcode><Keywords>0x8080000000000000</Keywords><TimeCreated SystemTime='2013-03-25T05:00:37.741000000Z'/><EventRecordID>17628</EventRecordID><Correlation/><Execution ProcessID='476' ThreadID='3028'/><Channel>System</Channel><Computer>AMAZONA-ONIST5V</Computer><Security UserID='S-1-5-18'/></System><EventData><Data Name='param1'>Windows Modules Installer</Data><Data Name='param2'>auto start</Data><Data Name='param3'>demand start</Data><Data Name='param4'>TrustedInstaller</Data></EventData></Event></Events>")
printingOutHash(h['Events']['Event'])
Converting XML to hashes is not a good idea when you're dealing with complex or large data files because walking a hash isn't very convenient in comparison to searching the XML. Parsing XML is really simple using the right tools like Nokogiri.
Basing off your XML:
require 'nokogiri'
xml = "
<Events>
<Event>
<System>
<Provider Name='Service Control Manager' Guid='{555908d1-a6d7-4695-8e1e-26931d2012f4}' EventSourceName='Service Control Manager'/>
<EventID Qualifiers='16384'>7036</EventID>
<Version>0</Version>
<Level>4</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8080000000000000</Keywords>
<TimeCreated SystemTime='2013-03-25T05:00:38.021800000Z'/>
<EventRecordID>17629</EventRecordID>
<Correlation/>
<Execution ProcessID='476' ThreadID='3028'/>
<Channel>System</Channel>
<Computer>AMAZONA-ONIST5V</Computer>
<Security/>
</System>
<EventData>
<Data Name='param1'>Windows Modules Installer</Data>
<Data Name='param2'>stopped</Data>
<Binary>540072007500730074006500640049006E007300740061006C006C00650072002F0031000000</Binary>
</EventData>
</Event>
<Event>
<System>
<Provider Name='Service Control Manager' Guid='{555908d1-a6d7-4695-8e1e-26931d2012f4}' EventSourceName='Service Control Manager'/>
<EventID Qualifiers='16384'>7040</EventID>
<Version>0</Version>
<Level>4</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8080000000000000</Keywords>
<TimeCreated SystemTime='2013-03-25T05:00:37.741000000Z'/>
<EventRecordID>17628</EventRecordID>
<Correlation/>
<Execution ProcessID='476' ThreadID='3028'/>
<Channel>System</Channel>
<Computer>AMAZONA-ONIST5V</Computer>
<Security UserID='S-1-5-18'/>
</System>
<EventData>
<Data Name='param1'>Windows Modules Installer</Data>
<Data Name='param2'>auto start</Data>
<Data Name='param3'>demand start</Data>
<Data Name='param4'>TrustedInstaller</Data>
</EventData>
</Event>
</Events>"
Here's the start of how I'd grab the data:
doc = Nokogiri::XML(xml)
pp doc.search('Event').map { |event|
system_provider_node = event.at('System Provider')
system = {
provider: {
name: system_provider_node['Name'],
guid: system_provider_node['Guid'],
event_source_name: system_provider_node['EventSourceName']
}
}
event_data = event.search('EventData Data').map{ |data|
{
name: data['Name'],
text: data.text
}
}
{
system: system,
event_data: event_data
}
}
Which results in:
[{:system=>
{:provider=>
{:name=>"Service Control Manager",
:guid=>"{555908d1-a6d7-4695-8e1e-26931d2012f4}",
:event_source_name=>"Service Control Manager"}},
:event_data=>
[{:name=>"param1", :text=>"Windows Modules Installer"},
{:name=>"param2", :text=>"stopped"}]},
{:system=>
{:provider=>
{:name=>"Service Control Manager",
:guid=>"{555908d1-a6d7-4695-8e1e-26931d2012f4}",
:event_source_name=>"Service Control Manager"}},
:event_data=>
[{:name=>"param1", :text=>"Windows Modules Installer"},
{:name=>"param2", :text=>"auto start"},
{:name=>"param3", :text=>"demand start"},
{:name=>"param4", :text=>"TrustedInstaller"}]}]
You don't have to build an array of hashes. For each <Event> you could peel off the elements and build separate rows in various tables in a database. It's really up to what works in your head.

Should Nokogiri::XML.parse be creating separate Text nodes for linefeeds?

I have an XML document created by an outside tool:
<?xml version="1.0" encoding="UTF-8"?>
<suite>
<id>S1</id>
<name>First Suite</name>
<description></description>
<sections>
<section>
<name>section 1</name>
<cases>
<case>
<id>C1</id>
<title>Test 1.1</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
<case>
<id>C2</id>
<title>Test 1.2</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
</cases>
</section>
</sections>
</suite>
From irb, I do the following: (Output suppressed until final command)
> require('nokogiri')
> doc = Nokogiri::XML.parse(open('./test.xml'))
> test_case = doc.search('case').first
=> #<Nokogiri::XML::Element:0x3ff75851bc44 name="case" children=[#<Nokogiri::XML::Text:0x3ff75851b8fc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b7bc name="id" children=[#<Nokogiri::XML::Text:0x3ff75851b474 "C1">]>, #<Nokogiri::XML::Text:0x3ff75851b1cc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b078 name="title" children=[#<Nokogiri::XML::Text:0x3ff75851ad58 "Test 1.1">]>, #<Nokogiri::XML::Text:0x3ff75851aa9c "\n ">, #<Nokogiri::XML::Element:0x3ff75851a970 name="type" children=[#<Nokogiri::XML::Text:0x3ff75851a6c8 "Other">]>, #<Nokogiri::XML::Text:0x3ff7585191d8 "\n ">, #<Nokogiri::XML::Element:0x3ff7585190d4 name="priority" children=[#<Nokogiri::XML::Text:0x3ff758518d64 "4 - Must Test">]>, #<Nokogiri::XML::Text:0x3ff758518ad0 "\n ">, #<Nokogiri::XML::Element:0x3ff7585189a4 name="estimate">, #<Nokogiri::XML::Text:0x3ff758518670 "\n ">, #<Nokogiri::XML::Element:0x3ff758518558 name="milestone">, #<Nokogiri::XML::Text:0x3ff7585182b0 "\n ">, #<Nokogiri::XML::Element:0x3ff758518184 name="references">, #<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">]>
This results in a number of children that look like the following:
#<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">
I want to iterate through these XML nodes without having to do something like:
> real_nodes = test_case.children.reject{|n| n.node_name == 'text' && n.content.strip!.empty?}
I couldn't find a parse parameter in the Nokogiri docs to suppress the treating of newlines as separate nodes. Is there a way to do this during the parse instead of after?
Check the documentation. You can just do this:
doc = Nokogiri::XML.parse(open('./test.xml')) do |config|
config.noblanks
end
That will load the file without any empty nodes.
The text nodes are the result of pretty-printing the XML. The spec doesn't require whitespace between tags, and, for efficiency, a huge XML file could be stripped of inter-tag whitespace to save space and reduce transfer time, without sacrificing the data content.
This might show what's happening:
require 'nokogiri'
xml = '<foo></foo>'
Nokogiri::XML(xml).at('foo').child
=> nil
With no whitespace between the tags there's no text node either.
xml = '<foo>
</foo>'
Nokogiri::XML(xml).at('foo').child
=> #<Nokogiri::XML::Text:0x3fcee9436ff0 "\n">
doc.at('foo').child.class
=> Nokogiri::XML::Text
With whitespace for pretty-printing, the XML has a text node following the foo tag.

trying to get content inside cdata tags in xml file using nokogiri

I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.
A snippet of the xml looks like this:
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
I am trying to parse this out to get the text associated with the NewsLineText
r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
puts r
puts s ? if s.blank? 'NOTHING' : s
puts t ? if t.blank? 'NOTHING' : t
What I get in return is
<newslinetext></newslinetext>
NOTHING
NOTHING
So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.
What do I need to do with nokogiri to get this text?
You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.
Use Nokogiri's XML parser and you will get things like this:
>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"
and you'll be able to get at the CDATA through r.text or r.children.
Ah I see. What #mu said is correct. But to get at the cdata directly, maybe:
xml =<<EOF
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}

Resources