Parsing an XML feed using Nokogiri isn't working - ruby

This is my code:
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.blank?
search.each do |data|
title=data.css("title").text
link=data.css("link").text
end
end
but I did not get the link.

Several things are wrong:
if !search.blank?
won't work because search would be a NodeSet returned by doc.css. NodeSet's don't have a blank? method. Perhaps you meant empty??
title=data.css("title").text
isn't the correct way to find the title because, like in the above problem, you're getting a NodeSet instead of a Node. Getting text from a NodeSet can return a lot of garbage you don't want. Instead do:
title=data.at("title").text
Changing the code to this:
require 'nokogiri'
require 'open-uri'
doc= Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search=doc.css('item')
if !search.empty?
search.each do |data|
title=data.at("title").text
link=data.at("link").text
puts "title: #{ title } link: #{ link }"
end
end
Outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link:
title: Freedom Center Offering Free Admission Monday link:
title: Miami University Band Performing in the Inaugural Parade link:
title: Northern Kentucky Man To Present Colors At Inauguration link:
title: John Gumms Monday Forecast link:
title: President Obama VP Biden sworn in officially begin second terms link:
title: Colerain Township Pizza Hut Robbed Saturday Night link:
title: Cold Snap Coming to Tri-State link:
title: 2 Men Arrested After Police Chase in Northern Kentucky link:
The link won't work because the XML is malformed, which, in my experience, is unbelievably common on the internet because people don't take the time to check their work.
The fix is going to take massaging the XML prior to Nokogiri receiving the content, or to modify your accessors. Luckily, this particular XML is easy to work around so this should help:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
search = doc.css('item')
if !search.empty?
search.each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
end
Which outputs:
title: Ex-Bengals cheerleaders lawsuit trial to begin link: http://www.cincinnatisun.com/index.php/sid/212072454/scat/90d24f4ad98a2793
title: Freedom Center Offering Free Admission Monday link: http://www.cincinnatisun.com/index.php/sid/212072914/scat/90d24f4ad98a2793
title: Miami University Band Performing in the Inaugural Parade link: http://www.cincinnatisun.com/index.php/sid/212072915/scat/90d24f4ad98a2793
title: Northern Kentucky Man To Present Colors At Inauguration link: http://www.cincinnatisun.com/index.php/sid/212072913/scat/90d24f4ad98a2793
title: John Gumms Monday Forecast link: http://www.cincinnatisun.com/index.php/sid/212070535/scat/90d24f4ad98a2793
title: President Obama VP Biden sworn in officially begin second terms link: http://www.cincinnatisun.com/index.php/sid/212060033/scat/90d24f4ad98a2793
title: Colerain Township Pizza Hut Robbed Saturday Night link: http://www.cincinnatisun.com/index.php/sid/212057132/scat/90d24f4ad98a2793
title: Cold Snap Coming to Tri-State link: http://www.cincinnatisun.com/index.php/sid/212057131/scat/90d24f4ad98a2793
title: 2 Men Arrested After Police Chase in Northern Kentucky link: http://www.cincinnatisun.com/index.php/sid/212057130/scat/90d24f4ad98a2793
All that done, you can write your code more clearly like:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.cincinnatisun.com/index.php?rss/90d24f4ad98a2793", 'User-Agent' => 'ruby'))
doc.css('item').each do |data|
title = data.at("title").text
link = data.at("link").next_sibling.text
puts "title: #{ title } link: #{ link }"
end
Interestingly enough, now the sample page appears to have its links fixed.

According to http://nokogiri.org/tutorials/searching_a_xml_html_document.html something like:
#doc = Nokogiri::XML(File.read("feed.xml"))
#doc.xpath('//xmlns:link')
should do the job. But be aware, that your provided xml snippet isn't a valid xml feed at all (no root element, item tag not opened - only closed etc.). The code assumes the xml feed looks i.e.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<item>
<title>Atom-Powered Robots Run Amok</title>
<link>http://example.org/2003/12/13/atom03</link>
</item>
</feed>
And extracts:
<link>http://example.org/2003/12/13/atom03</link>
as result. Please try to to look at the documentation/reference first, if you have problems like this. If you tried something and it didn't worked like you would expect, than you can consult stackoverflow with actual code - that makes it easier to understand your problem & provide help.

Related

Parsing XML into CSV using Nokogiri

I am trying to figure out how to get Make and Model out of XML returned from a URL and put them into a CSV. Here is the XML returned from the URL:
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
Here is the code I have so far:
require 'csv'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
vincarriercsv = 'vincarrier.csv'
vindetails = 'vindetails.csv'
vinurl = 'http://redacted/LookUp_VIN?key=redacted&vin='
CSV.open(vindetails, "wb") do |details|
CSV.foreach(vincarriercsv) do |row|
vinxml = Nokogiri::HTML(vinurl + row[1])
make = vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text
model = vinxml.xpath('//VINResult//Vehicles//Vehicle//Model').text
details << [ row[0], row[1], make, model ]
end
end
For some reason the URL returns the same data twice but I only need the first result. So far my attempts to grab the Make and Model from the XML has failed...any ideas?
Here's how to get at the make and model data. How to convert it to CSV is left to you:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
<Vehicles>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
<Vehicle>
<ID>131497</ID>
<Product>TRUCK</Product>
<Year>1993</Year>
<Make>Freightliner</Make>
<Model>FLD12064T</Model>
<Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
</Vehicle>
</Vehicles>
<Errors/>
<InvalidVINMsg/>
</VINResult>
EOT
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
'make', vehicle.at('Make').content,
'model', vehicle.at('Model').content
]
}
This results in:
vehicle_make_and_models # => [["make", "Freightliner", "model", "FLD12064T"], ["make", "Freightliner", "model", "FLD12064T"]]
If you don't want the field names:
vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
[
vehicle.at('Make').content,
vehicle.at('Model').content
]
}
vehicle_make_and_models # => [["Freightliner", "FLD12064T"], ["Freightliner", "FLD12064T"]]
Note: You have XML, not HTML. Don't assume that Nokogiri treats them the same, or that the difference is insignificant. Nokogiri parses XML strictly, since XML is a strict standard.
I use CSS selectors unless I absolutely have to use XPath. CSS results in a much clearer selector most of the time, which results in easier to read code.
vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text doesn't work, because // means "start at the top of the document". Each time it's encountered Nokogiri starts at the top, searches down, and finds all matching nodes. xpath returns all matching nodes as a NodeSet, not just a particular Node, and text will return the text of all Nodes in the NodeSet, resulting in a concatenated string of the text, which is probably not what you want.
I prefer to use search instead of xpath or css. It returns a NodeSet like the other two, but it also lets us use either CSS or XPath selectors. If your particular selector was ambiguous and could be interpreted as either CSS or XPath, then you can use the explicit form. Likewise, you can use at or xpath_at or css_at to find just the first matching node, which is equivalent to search('foo').first.
You could also do the following which will place all of the vehicles in an Array and all of the vehicle attributes into a Hash
require 'nokogiri'
doc = Nokogiri::XML(open(YOUR_XML_FILE))
vehicles = doc.search("Vehicle").map do |vehicle|
Hash[
vehicle.children.map do |child|
[child.name, child.text] unless child.text.chomp.strip == ""
end.compact
]
end
#=>[{"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}, {"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes Power Steering 6x4 (SBA - Set Back Axle)"}]
Then you can access all the attributes for an individual vehicle i.e.
vehicles.first["ID"]
#=> "131497"
vehicles.first["Year"]
#=> "1993"
etc.

Ruby - URL to Markdown

TOTAL rookie here.
I'm working on customizing a script made by Brett Terpstra - http://brettterpstra.com/2013/11/01/save-pocket-favorites-to-nvalt-with-ifttt-and-hazel/
Mine is a different use: I'd like to save my pinboard bookmarks with a specific tag to a file in dropbox in Markdown.
I feed it a text file such as:
Title: Yesterday is over.
URL: http://www.jonacuff.com/blog/want-to-change-the-world-get-doing/
Tags: 2md, 2wcx, 2pdf
Date: June 20, 2013 at 06:20PM
Image: notused
Excerpt: You can't start the next chapter of your life if you keep re-reading the last one.
And it outputs the markdown file.
Everything works great except when the 'excerpt' (see above) is more than one line. Sometimes it's a couple of paragraphs. When that happens, it stops working. When I hit enter from the command line, it's still waiting for more input.
Here's an example of a file that it doesn't work on:
Title: Talking ’bout my Generation.
URL: http://blog.greglaurie.com/?p=8881
Tags: 2md, 2wcx, 2pdf
Date: June 28, 2013 at 09:46PM
Image: notused
Excerpt: Contrast two men from the 19th century: Max Jukes and Jonathan Edwards.
Max Jukes lived in New York. He did not believe in Christ or in raising his children in the way of the Lord. He refused to take his children to church, even when they asked to go. Of his 1,026 descendants:
•300 were sent to prison for an average term of 13 years
•190 were prostitutes
•680 were admitted alcoholics
His family, thus far, has cost the state in excess of $420,000 and has made no contribution to society.
Jonathan Edwards also lived in New York, at the same time as Jukes. He was known to have studied 13 hours a day and, in spite of his busy schedule of writing, teaching, and pastoring, he made it a habit to come home and spend an hour each day with his children. He also saw to it that his children were in church every Sunday. Of his 929 descendants:
•430 were ministers
•86 became university professors
•13 became university presidents
•75 authored good books
•7 were elected to the United States Congress
•1 was Vice President of the United States
Edwards’ family never cost the state one cent.
We tend to think that our decisions only affect ourselves, but they have ramifications for generations to come.
Here's a screenshot of what it looks like after I run the command: https://www.dropbox.com/s/i9zg483k7nkdp6f/Screenshot%202013-11-22%2016.39.17.png
I'm hoping it's something easy. Any ideas?
#!/usr/bin/env ruby
# Works with IFTTT recipe https://ifttt.com/recipes/125999
#
# Set Hazel to watch the folder you specify in the recipe.
# Make sure nvALT is set to store its notes as individual files.
# Edit the $target_folder variable below to point to your nvALT
# ntoes folder.
require 'date'
require 'open-uri'
require 'net/http'
require 'fileutils'
require 'cgi'
$target_folder = "~/Dropbox/messx/urls2md"
def url_to_markdown(url)
res = Net::HTTP.post_form(URI.parse("http://heckyesmarkdown.com/go/"),{'u'=>url,'read'=>'1'})
if res.code.to_i == 200
res.body
else
false
end
end
file = ARGV[0]
begin
input = IO.read(file).force_encoding('utf-8')
headers = {}
input.each_line {|line|
key, value = line.split(/: /)
headers[key] = value.strip || ""
}
outfile = File.join(File.expand_path($target_folder), headers['Title'].gsub(/["!*?'|]/,'') + ".txt")
date = Time.now.strftime("%Y-%m-%d %H:%M")
date_added = Date.parse(headers['Date']).strftime("%Y-%m-%d %H:%M")
content = "Title: #{headers['Title']}\nDate: #{date}\nDate Added: #{date_added}\nSource: #{headers['URL']}\n"
tags = false
if headers['Tags'].length > 0
tag_arr = header s['Tags'].split(", ")
tag_arr.map! {|tag|
%Q{"#{tag.strip}"}
}
tags = tag_arr.join(" ")
content += "Keywords: #{tags}\n"
end
markdown = url_to_markdown(headers['URL']).force_encoding('utf-8')
if markdown
content += headers['Image'].length > 0 ? "\n\n> #{headers['Excerpt']}\n\n---#{markdown}\n" : "\n\n"+markdown
else
content += headers['Image'].length > 0 ? "\n\n![](#{headers['Image']})\n\n#{headers['Excerpt']}\n" : "\n\n"+headers['Excerpt']
end
File.open(outfile,'w') {|f|
f.puts content
}
if tags && File.exists?("/usr/local/bin/openmeta")
%x{/usr/local/bin/openmeta -a #{tags} -p "#{outfile}"}
end
# FileUtils.rm(file)
rescue Exception => e
puts e
end
How about this? Modify your input.each_line area accordingly:
headers = {}
key = nil
input.each_line do |line|
match = /^(?<key>\w+)\s*:\s*(?<value>.*)/.match(line)
value = line
if match
key = match[:key].strip
headers[key] = match[:value].strip
else
headers[key] += line
end
end
First, splitting on just ":" is dangerous since that can be in content. Instead, a (simplified from code) regex of /^\w+:.*/ will match "Word: Content". Since the lines after the "Excerpt:" aren't prefixed, you need to hang on to the last seen key, and just append if there's no key for this line. You may need to add a newline in there, depending on what you're doing with that header information, but it seems to work.

How to parse a more complicated JSON object in Ruby on Sinatra

I'm a Java guy, new to Ruby. I've been playing with it just to see what it can do, and I'm running into an issue that I can't solve.
I decided to try out Sinatra, again, just to see what it can do, and decided to play with the ESPN API and see if I can pull the venue of a team via the API.
I'm able to make the call and get the data back, but I am having trouble parsing it:
{"sports"=>[{"name"=>"baseball", "id"=>1, "uid"=>"s:1", "leagues"=>[{"name"=>"Major League Baseball", "abbreviation"=>"mlb", "id"=>10, "uid"=>"s:1~l:10", "groupId"=>9, "shortName"=>"MLB", "teams"=>[{"id"=>17, "uid"=>"s:1~l:10~t:17", "location"=>"Cincinnati", "name"=>"Reds", "abbreviation"=>"CIN", "color"=>"D60042", "venues"=>[{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}], "links"=>{"api"=>{"teams"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17"}, "news"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"}, "notes"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}}, "web"=>{"teams"=>{"href"=>"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}}, "mobile"=>{"teams"=>{"href"=>"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}], "resultsOffset"=>0, "resultsLimit"=>50, "resultsCount"=>1, "timestamp"=>"2013-08-04T14:47:13Z", "status"=>"success"}
I want to pull the venues part of the object, specifically the name value. Every time I try to parse it I end up getting an error along the lines of "cannot change from nil to string" and then also I've gotten an integer to string error.
Here's what i have so far:
get '/venue/:team' do
id = ids[params[:team]]
url = 'http://api.espn.com/v1/sports/baseball/mlb/teams/' + id + '?enable=venues&apikey=' + $key
resp = Net::HTTP.get_response(URI.parse(url))
data = resp.body
parsed = JSON.parse(resp.body)
#venueData = parsed["sports"]
"Looking for the venue of the #{params[:team]}, which has id " + id + ", and here's the data returned: " + venueData.to_s
end
When I do parsed["sports"} I get:
[{"name"=>"baseball", "id"=>1, "uid"=>"s:1", "leagues"=>[{"name"=>"Major League Baseball", "abbreviation"=>"mlb", "id"=>10, "uid"=>"s:1~l:10", "groupId"=>9, "shortName"=>"MLB", "teams"=>[{"id"=>17, "uid"=>"s:1~l:10~t:17", "location"=>"Cincinnati", "name"=>"Reds", "abbreviation"=>"CIN", "color"=>"D60042", "venues"=>[{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}], "links"=>{"api"=>{"teams"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17"}, "news"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"}, "notes"=>{"href"=>"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}}, "web"=>{"teams"=>{"href"=>"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}}, "mobile"=>{"teams"=>{"href"=>"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}]
But nothing else parses. Please help!
Like I said, I'm not trying to do anything fancy, just figure out Ruby a little for fun, but I have been stuck on this issue for days now. Any help would be appreciated!
EDIT:
JSON straight from the API:
{"sports" :[{"name" :"baseball","id" :1,"uid" :"s:1","leagues" :[{"name" :"Major League Baseball","abbreviation" :"mlb","id" :10,"uid" :"s:1~l:10","groupId" :9,"shortName" :"MLB","teams" :[{"id" :17,"uid" :"s:1~l:10~t:17","location" :"Cincinnati","name" :"Reds","abbreviation" :"CIN","color" :"D60042","venues" :[{"id" :83,"name" :"Great American Ball Park","city" :"Cincinnati","state" :"Ohio","country" :"","capacity" :0}],"links" :{"api" :{"teams" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17"},"news" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news"},"notes" :{"href" :"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes"}},"web" :{"teams" :{"href" :"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public"}},"mobile" :{"teams" :{"href" :"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public"}}}}]}]}],"resultsOffset" :0,"resultsLimit" :50,"resultsCount" :1,"timestamp" :"2013-08-05T19:44:32Z","status" :"success"}
The result of data.inspect:
"{\"sports\" :[{\"name\" :\"baseball\",\"id\" :1,\"uid\" :\"s:1\",\"leagues\" :[{\"name\" :\"Major League Baseball\",\"abbreviation\" :\"mlb\",\"id\" :10,\"uid\" :\"s:1~l:10\",\"groupId\" :9,\"shortName\" :\"MLB\",\"teams\" :[{\"id\" :17,\"uid\" :\"s:1~l:10~t:17\",\"location\" :\"Cincinnati\",\"name\" :\"Reds\",\"abbreviation\" :\"CIN\",\"color\" :\"D60042\",\"venues\" :[{\"id\" :83,\"name\" :\"Great American Ball Park\",\"city\" :\"Cincinnati\",\"state\" :\"Ohio\",\"country\" :\"\",\"capacity\" :0}],\"links\" :{\"api\" :{\"teams\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17\"},\"news\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news\"},\"notes\" :{\"href\" :\"http://api.espn.com/v1/sports/baseball/mlb/teams/17/news/notes\"}},\"web\" :{\"teams\" :{\"href\" :\"http://espn.go.com/mlb/team/_/name/cin/cincinnati-reds?ex_cid=espnapi_public\"}},\"mobile\" :{\"teams\" :{\"href\" :\"http://m.espn.go.com/mlb/clubhouse?teamId=17&ex_cid=espnapi_public\"}}}}]}]}],\"resultsOffset\" :0,\"resultsLimit\" :50,\"resultsCount\" :1,\"timestamp\" :\"2013-08-05T19:44:24Z\",\"status\" :\"success\"}"
parsed["sports"] does not exist, parse your input and inspect it/ dump it
With the data you've provided in the question, you can get to the venues information like this:
require 'json'
json = JSON.parse data
json["sports"].first["leagues"].first["teams"].first["venues"]
# => [{"id"=>83, "name"=>"Great American Ball Park", "city"=>"Cincinnati", "state"=>"Ohio", "country"=>"", "capacity"=>0}]
By replacing each of the first calls with an iterator, you can search through without knowing where the data is:
json["sports"].each{|h|
h["leagues"].each{|h|
h["teams"].each{|h|
venues = h["venues"].map{|h| h["name"]}.join(", ")
puts %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
}
}
This outputs:
name: Cincinnati Reds venues: Great American Ball Park
Depending on how stable the response data is you may be able to cut out several of the iterators:
json["sports"].first["leagues"]
.first["teams"]
.each{|h|
venues = h["venues"].map{|h| h["name"] }.join(", ")
puts %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
and you'll most likely want to save the data, so something like each_with_object is helpful:
team_and_venues = json["sports"].first["leagues"]
.first["teams"]
.each_with_object([]){|h,xs|
venues = h["venues"].map{|h| h["name"]}.join(", ")
xs << %Q!name: #{h["location"]} #{h["name"]} venues: #{venues}!
}
# => ["name: Cincinnati Reds venues: Great American Ball Park"]
team_and_venues
# => ["name: Cincinnati Reds venues: Great American Ball Park"]
Notice that when an iterator declares variables, even if there is a variable with the same name outside the block, the scope of the block is respected and the block's variables remain local.
That's some pretty ugly code if you ask me, but it's a place to start.

Grab xml-xpath by attribute

http://www.mdr.de/export/sandmann/folgen/sandmann612-mediaRss_doca-1_zc-1a3071ad.xml returns, besides others, these lines:
(...)
<media:content url="http://x4100mp4dynonlc22033.f.o.l.lb.core-cdn.net/22033mdr/ondemand/4100mp4dynonl/FCMS-066eb3e7-81b2-4dae-898d-4963137eb4b6-e9ebd6e42ce1.mp4" type="video/mpeg" expression="full" width="512" height="288" bitrate="512" duration="398" />
<media:content url="http://x4100mp4dynonlc22033.f.o.l.lb.core-cdn.net/22033mdr/ondemand/4100mp4dynonl/FCMS-066eb3e7-81b2-4dae-898d-4963137eb4b6-c7cca1d51b4b.mp4" type="video/mpeg" expression="full" width="960" height="544" bitrate="1536" duration="398" />
(...)
How would I tell Nokogiri to extract only the line where bitrate="1536"?
I'd actually just need the URL within that XPath, so I expect (I find it rather rude to write "expect" here, but I was told to do so ;) the following string returned:
http://x4100mp4dynonlc22033.f.o.l.lb.core-cdn.net/22033mdr/ondemand/4100mp4dynonl/FCMS-066eb3e7-81b2-4dae-898d-4963137eb4b6-c7cca1d51b4b.mp4
If someone is interested, this will allow me to download the daily episode of the Sandmännchen, a german TV miniseries for Little kids. :)
So far I have tried using simpleRSS with this:
(...)
rss.entries.each do |entry|
pp entry
end
But that only returns the first item of the media:group "set" of links:
{:title=>"Sandmann vom 14. Oktober 2012",
:link=>"http://www.mdr.de/export/sandmann/folgen/video78338.html",
:description=>
"Die j\xC3\xBCngste Geschichte vom Sandmann gibt es f\xC3\xBCr 24 Stunden hier auf Abruf. Heute: Molly mag keine Schuhe. Das finden die anderen Monster merkw\xC3\xBCrdig, weil Monster Schuhe lieben.",
:pubDate=>2012-09-19 14:54:43 +0200,
:guid=>
"mp4:4100mp4dynonl/FCMS-066eb3e7-81b2-4dae-898d-4963137eb4b6-8442e17c3177",
:media_content_url=>
"rtmp://x4100mp4dynonlc22033.f.o.f.lb.core-cdn.net/22033mdr/ondemand",
:media_content_type=>"fms/h264",
:media_content_height=>"272",
:media_content_width=>"480",
:media_title=>"Sandmann vom 14. Oktober 2012",
:media_thumbnail_url=>
"http://www.mdr.de/export/sandmann/folgen/sandmann864_v-standard43_zc-698fff06.jpg",
:media_thumbnail_height=>"135",
:media_thumbnail_width=>"180"}
How about this:
doc.at_xpath('//media:content[#bitrate="1536"]/#url').text
#=> "http://www.mdr.de/export/sandmann/folgen/sandmann612-mediaRss__zc-1a3071ad.xml"
The link by the way doesn't work, so I wasn't actually able to test this on the full document.
UPDATE:
Using the info from your answer below, in nokogiri:
filme = Nokogiri::XML(open('http://www.sandmann.de/static/san/app/filme.xml'))
folge = Nokogiri::XML(open(filme.xpath('//filme/folge').text))
folge.at_xpath('//media:content[#bitrate="1536"]/#url').text
#=> "http://x4100mp4dynonlc22033.f.o.l.lb.core-cdn.net/22033mdr/ondemand/4100mp4dynonl/FCMS-066eb3e7-81b2-4dae-898d-4963137eb4b6-c7cca1d51b4b.mp4"
This is what I came up with in the end - no nokogiri (which, I assume, is very powerful but has a rather steep learning curve. Plus, I simply don't understand it...) but crack instead. It seems to be more rubyish and plays along nicely with the MRSS feed I am getting:
require 'rubygems'
require 'pp'
require 'crack'
require 'asciify'
require 'open-uri'
fileurl = ""
filme = Crack::XML.parse(open('http://www.sandmann.de/static/san/app/filme.xml'))
folge = Crack::XML.parse(open(filme['filme']['folge']))
titel = folge['rss']['channel']['item']['description'].to_s.sub(/.*Die jüngste Geschichte vom Sandmann gibt es für 24 Stunden hier auf Abruf. Heute: /, '')
folge['rss']['channel']['item']['media:group']['media:content'].each do |x|
fileurl << x['url'] if x['bitrate'] == "1536"
end
filename = titel.split(".").first.asciify + ".m4v"
filename.gsub!(" ","_")
system("curl -o \"#{filename}\" \"#{fileurl}\"")
Just in case your kids want to watch, too ;)
To make it easy, simply:
doc.at('content[#bitrate="1536"]')[:url]
require 'nokogiri'
require 'open-uri'
url = 'http://www.mdr.de/export/sandmann/folgen/sandmann612-mediaRss_doca-1_zc-1a3071ad.xml'
doc = Nokogiri.XML(open(url))
doc.remove_namespaces! # Just to make our life simpler
content = doc.at_css('content[bitrate="1536"]')
puts content['url']
#=> http://x4100mp4dynonlc22033.f.o.l.lb.core-cdn.net/22033mdr/ondemand/4100mp4dynonl/FCMS-fd2af820-ec90-4f34-a58e-db1b9fdcc25a-c7cca1d51b4b.mp4

trying to get content inside cdata tags in xml file using nokogiri

I have seen several things on this, but nothing has seemed to work so far. I am parsing an xml via a url using nokogiri on rails 3 ruby 1.9.2.
A snippet of the xml looks like this:
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
I am trying to parse this out to get the text associated with the NewsLineText
r = node.at_xpath('.//newslinetext') if node.at_xpath('.//newslinetext')
s = node.at_xpath('.//newslinetext').text if node.at_xpath('.//newslinetext')
t = node.at_xpath('.//newslinetext').content if node.at_xpath('.//newslinetext')
puts r
puts s ? if s.blank? 'NOTHING' : s
puts t ? if t.blank? 'NOTHING' : t
What I get in return is
<newslinetext></newslinetext>
NOTHING
NOTHING
So I know my tags are named/spelled correctly to get at the newslinetext data, but the cdata text never shows up.
What do I need to do with nokogiri to get this text?
You're trying to parse XML using Nokogiri's HMTL parser. If node as from the XML parser then r would be nil since XML is case sensitive; your r is not nil so you're using the HTML parser which is case insensitive.
Use Nokogiri's XML parser and you will get things like this:
>> r = doc.at_xpath('.//NewsLineText')
=> #<Nokogiri::XML::Element:0x8066ad34 name="NewsLineText" children=[#<Nokogiri::XML::Text:0x8066aac8 "\n ">, #<Nokogiri::XML::CDATA:0x8066a9c4 "\n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n ">, #<Nokogiri::XML::Text:0x8066a8d4 "\n">]>
>> r.text
=> "\n \n Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.\n \n"
and you'll be able to get at the CDATA through r.text or r.children.
Ah I see. What #mu said is correct. But to get at the cdata directly, maybe:
xml =<<EOF
<NewsLineText>
<![CDATA[
Anna Kendrick is ''obsessed'' with 'Game of Thrones' and loves to cook, particularly creme brulee.
]]>
</NewsLineText>
EOF
node = Nokogiri::XML xml
cdata = node.search('NewsLineText').children.find{|e| e.cdata?}

Resources