trying to get content inside cdata tags in xml file using nokogiri - ruby

I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.
A snippet of the xml looks like this:
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
I am trying to parse this out to get the text associated with the NewsLineText
r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
puts r
puts s ? if s.blank? 'NOTHING' : s
puts t ? if t.blank? 'NOTHING' : t
What I get in return is
<newslinetext></newslinetext>
NOTHING
NOTHING
So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.
What do I need to do with nokogiri to get this text?

You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.
Use Nokogiri's XML parser and you will get things like this:
>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"
and you'll be able to get at the CDATA through r.text or r.children.

Ah I see. What #mu said is correct. But to get at the cdata directly, maybe:
xml =<<EOF
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}

Related

Unable to get nokogiri xmlelement text value

I have the following XML content.
<info>
<meta name="alias">alias1</meta>
<meta name="score">.60</meta>
</info>
<info>
<meta name="alias">alias2</meta>
<meta name="score">.50</meta>
</info>
I need to get back for each value, but having difficulty doing so.
doc.xpath("//info").each do |info_entry|
info_entry.xpath("meta").each do |meta_entry|
if meta_entry['name'] == 'alias'
the_alias = meta_entry.xpath('text()').text
elsif meta_entry['name'] == 'score'
score = meta_entry.xpath('text()').text
end
// add struct containing alias and score to list
end
end
However, I'm not geting anything from text. I've tried many different things: inner_text, inner_html, content, value, nothing works. I've tried meta_entry.at, meta_entry.search, and so on.
Is there something I'm missing? Any advice would be appreciated.
You need to get rid of xpath('text()'). And you can get rid of the conditional and build a workable data structure as you go, like this:
def meta_contents(doc)
doc.xpath("//info").map do |info_entry|
info_entry.xpath("meta").map do |meta_entry|
[meta_entry["name"], meta_entry.text]
end.to_h
end
end
>> meta_contents(doc)
#> [{"alias"=>"alias1", "score"=>".60"}, {"alias"=>"alias2", "score"=>".50"}]

How to filter XML elements by date range in Ruby

I typically use Nokogiri as my XML parser.
I have the following XML:
<albums>
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<classix_nouveaux album="Night People"/>
<release_date value="19820501"/>
</classix_nouveaux>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
</albums>
I want to get all albums that were released between 1/1/1980 and 4/15/1982:
<aldo_nova album="aldo nova">
<release_date value="19820401"/>
</aldo_nova>
<engligh_beat album="I Just Can't Stop It"/>
<release_date value="19800501"/>
</engligh_beat>
How do I filter/query the XML by a release_date range?
Your XML is malformed. After parsing, here's what Nokogiri has to say about it:
doc.errors
# => [#<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: albums line 1 and classix_nouveaux>,
# #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>]
That's because:
<classix_nouveaux album="Night People"/>
and
<engligh_beat album="I Just Can't Stop It"/>
are terminated. Instead they should be:
<classix_nouveaux album="Night People">
and
<engligh_beat album="I Just Can't Stop It">
You can use CSS or XPath selectors to find exact matches, or even sub-string matches, but neither CSS or XPath understand "ranges" of dates, nor do they have an idea of what a Date is, so you'd have to extract all nodes, convert the date value into a Date object or integer in this case, then compare to the range:
date_range = 19800501..19820401
selected_albums = doc.search('//release_date').select { |rd| date_range.include?(rd['value'].to_i) }.map { |rd| rd.parent }
selected_albums.map(&:to_xml)
# => ["<aldo_nova album=\"aldo nova\">\n" +
# " <release_date value=\"19820401\"/>\n" +
# "</aldo_nova>",
# "<engligh_beat album=\"I Just Can't Stop It\">\n" +
# " <release_date value=\"19800501\"/>\n" +
# "</engligh_beat>"]
I think your XML is poorly designed because you have varying tag names for what should be an album. <album> should be a child of <albums>. I'd recommend something like this:
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
Once the XML is in a standard form, then it becomes easier to navigate and search:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<collection>
<albums>
<album band="aldo nova" title="aldo nova" release_date="19820401"/>
<album band="classix nouveaux" title="Night People" release_date="19820501"/>
<album band="english beat" title="I Just Can't Stop It" release_date="19800501"/>
</albums>
</collection>
EOT
doc.search('album').last['title'] # => "I Just Can't Stop It"
band = 'aldo nova'
doc.search("//album[#band='#{band}']").map { |a| a['title'] } # => ["aldo nova"]
and searching for dates becomes more straightforward because it's not necessary to find the parent of the node:
date_range = 19800501..19820401
selected_albums = doc.search('album').select { |a| date_range.include?(a['release_date'].to_i) }
selected_albums.map(&:to_xml)
# => ["<album band=\"aldo nova\" title=\"aldo nova\" release_date=\"19820401\"/>",
# "<album band=\"english beat\" title=\"I Just Can't Stop It\" release_date=\"19800501\"/>"]
I'd recommend reading some tutorials on XML itself as it's easy to paint ourselves into corners if the data isn't represented logically and correctly.

How to pull data from tags based on other tags

I have the following example document:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>Doe</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<CorrectedInd>0</CorrectedInd>
<irs:TaxYr>2015</irs:TaxYr>
<EmployeeInfoGrp>
<OtherCompletePersonName>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonMiddleNm>B</PersonMiddleNm>
<PersonLastNm>DOE</PersonLastNm>
</OtherCompletePersonName>
<PersonNameControlTxt/>
<irs:TINRequestTypeCd>INDIVIDUAL_TIN</irs:TINRequestTypeCd>
<irs:SSN>222222222</irs:SSN>
</EmployeeInfoGrp>
</Form1095CUpstreamDetail>
</n1:Form109495CTransmittalUpstream>
Using Nokogiri I want to extract the value between the <PersonFirstNm>, <PersonLastNm> and <irs:SSN> for each <Form1095CUpstreamDetail> based on the <RecordId>.
I tried removing namespaces as well. I posted a small snippet, but I have tried many iterations of working through the XML with no success. This is my first time using XML, so I realize I am likely missing something easy.
When I set my XPath:
require 'nokogiri'
submission_doc = Nokogiri::XML(open('1094C_Request.xml'))
submissions = submission_doc.remove_namespaces
nodes = submission.xpath('//Form1095CUpstreamDetail')
I do not seem to have any association between the RecordId and the tags mentioned above, and I am stuck on where to go next.
The fields are not listed as children for the RecordId, so I can't think of how to approach obtaining their values. I am including the full document as an example to make sure I am not excluding anything.
I have an array of values, and I would like to pull the three tags mentioned above if the RecordId is contained within the array of numbers.
Nokogiri makes it pretty easy to do what you want (assuming the XML is syntactically correct). I'd do something like:
require 'nokogiri'
require 'pp'
doc = Nokogiri::XML(<<EOT)
<n1:Form109495CTransmittalUpstream xmlns="urn:us:gov:treasury:irs:ext:aca:air:7.0" xmlns:irs="urn:us:gov:treasury:irs:common" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage IRS-Form1094-1095CTransmitterUpstreamMessage.xsd" xmlns:n1="urn:us:gov:treasury:irs:msg:form1094-1095Ctransmitterupstreammessage">
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>1</RecordId>
<PersonFirstNm>JOHN</PersonFirstNm>
<PersonLastNm>Doe</PersonLastNm>
<irs:SSN>123456790</irs:SSN>
</Form1095CUpstreamDetail>
<Form1095CUpstreamDetail RecordType="String" lineNum="1">
<RecordId>2</RecordId>
<PersonFirstNm>JANE</PersonFirstNm>
<PersonLastNm>DOE</PersonLastNm>
<irs:SSN>222222222</irs:SSN>
</Form1095CUpstreamDetail>
</Form109495CTransmittalUpstream>
EOT
info = doc.search('Form1095CUpstreamDetail').map{ |form|
{
record_id: form.at('RecordId').text,
person_first_nm: form.at('PersonFirstNm').text,
person_last_nm: form.at('PersonLastNm').text,
ssn: form.at('irs|SSN').text
}
}
pp info
# >> [{:record_id=>"1",
# >> :person_first_nm=>"JOHN",
# >> :person_last_nm=>"Doe",
# >> :ssn=>"123456790"},
# >> {:record_id=>"2",
# >> :person_first_nm=>"JANE",
# >> :person_last_nm=>"DOE",
# >> :ssn=>"222222222"}]
While it's possible to do this with XPath, Nokogiri's implementation of CSS selectors tends to result in more easily read selectors, which translates to easier to maintain, which is a very good thing.
You'll see the use of | in 'irs|SSN' which is Nokogiri's way of defining a namespace for CSS. This is documented in "Namespaces".
First of all the xml validator reports error
The default (no prefix) Namespace URI for XPath queries is always '' and it cannot be redefined to 'urn:us:gov:treasury:irs:ext:aca:air:7.0'.
so you must set this default xmlns to "".
You can use this code.
require 'nokogiri'
doc = Nokogiri::XML(open('1094C_Request.xml'))
doc.namespaces['xmlns'] = ''
details = doc.xpath("//:Form1095CUpstreamDetail")
elem_a = ["PersonFirstNm", "PersonLastNm", "irs:SSN"]
output = details.each_with_object({}) do |element, exp|
exp[element.xpath("./:RecordId").text] = elem_a.each_with_object({}) do |elem_n, exp_h|
exp_h[elem_n] = element.xpath(".//#{elem_n.include?(':') ? elem_n : ":#{elem_n}"}").text
end
end
output
p output
# {
# "1" => {"PersonFirstNm" => "JOHN", "PersonLastNm" => "Doe", "irs:SSN" => "123456790"},
# "2" => {"PersonFirstNm" => "JANE", "PersonLastNm" => "DOE", "irs:SSN" => "222222222"}
# }
I hope this helps

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

Parsing an XML feed using Nokogiri isn't working

This is my code:
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.blank?
search.each do |data|
title=data.css("title").text
link=data.css("link").text
end
end
but I did not get the link.
Several things are wrong:
if !search.blank?
won't work because search would be a NodeSet returned by doc.css. NodeSet's don't have a blank? method. Perhaps you meant empty??
title=data.css("title").text
isn't the correct way to find the title because, like in the above problem, you're getting a NodeSet instead of a Node. Getting text from a NodeSet can return a lot of garbage you don't want. Instead do:
title=data.at("title").text
Changing the code to this:
require 'nokogiri'
require 'open-uri'
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.empty?
search.each do |data|
title=data.at("title").text
link=data.at("link").text
puts "title: #{ title } link: #{ link }"
end
end
Outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link:
title: Freedom Center Offering Free Admission Monday link:
title: Miami University Band Performing in the Inaugural Parade link:
title: Northern Kentucky Man To Present Colors At Inauguration link:
title: John Gumms Monday Forecast link:
title: President Obama VP Biden sworn in officially begin second terms link:
title: Colerain Township Pizza Hut Robbed Saturday Night link:
title: Cold Snap Coming to Tri-State link:
title: 2 Men Arrested After Police Chase in Northern Kentucky link:
The link won't work because the XML is malformed, which, in my experience, is unbelievably common on the internet because people don't take the time to check their work.
The fix is going to take massaging the XML prior to Nokogiri receiving the content, or to modify your accessors. Luckily, this particular XML is easy to work around so this should help:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search = doc.css('item')
if !search.empty?
search.each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
end
Which outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link: http://www.cincinnatisun.com/index.php/sid/212072454/scat/90d24f4ad98a2793
title: Freedom Center Offering Free Admission Monday link: http://www.cincinnatisun.com/index.php/sid/212072914/scat/90d24f4ad98a2793
title: Miami University Band Performing in the Inaugural Parade link: http://www.cincinnatisun.com/index.php/sid/212072915/scat/90d24f4ad98a2793
title: Northern Kentucky Man To Present Colors At Inauguration link: http://www.cincinnatisun.com/index.php/sid/212072913/scat/90d24f4ad98a2793
title: John Gumms Monday Forecast link: http://www.cincinnatisun.com/index.php/sid/212070535/scat/90d24f4ad98a2793
title: President Obama VP Biden sworn in officially begin second terms link: http://www.cincinnatisun.com/index.php/sid/212060033/scat/90d24f4ad98a2793
title: Colerain Township Pizza Hut Robbed Saturday Night link: http://www.cincinnatisun.com/index.php/sid/212057132/scat/90d24f4ad98a2793
title: Cold Snap Coming to Tri-State link: http://www.cincinnatisun.com/index.php/sid/212057131/scat/90d24f4ad98a2793
title: 2 Men Arrested After Police Chase in Northern Kentucky link: http://www.cincinnatisun.com/index.php/sid/212057130/scat/90d24f4ad98a2793
All that done, you can write your code more clearly like:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
doc.css('item').each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
Interestingly enough, now the sample page appears to have its links fixed.
According to http://nokogiri.org/tutorials/searching_a_xml_html_document.html something like:
#doc = Nokogiri::XML(File.read("feed.xml"))
#doc.xpath('//xmlns:link')
should do the job. But be aware, that your provided xml snippet isn't a valid xml feed at all (no root element, item tag not opened - only closed etc.). The code assumes the xml feed looks i.e.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<item>
<title>Atom-Powered Robots Run Amok</title>
<link>http://example.org/2003/12/13/atom03</link>
</item>
</feed>
And extracts:
<link>http://example.org/2003/12/13/atom03</link>
as result. Please try to to look at the documentation/reference first, if you have problems like this. If you tried something and it didn't worked like you would expect, than you can consult stackoverflow with actual code - that makes it easier to understand your problem & provide help.

Resources