Unable to get certain elements from RSS feed - ruby

I'm having trouble with the default ruby rss parser, specifically getting a couple elements from the Rogan podcast feed:
http://joeroganexp.joerogan.libsynpro.com/rss
open(url) do |rss|
feed = RSS::Parser.parse(rss)
Now when I look at feed, I can't find the following:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:cc="http://web.resource.org/cc/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
And strangely enough I can get all the channel data except the last one (itunes:type):
<itunes:type>episodic</itunes:type>

Related

Nokogirl scraping talk from AudioDharma page

Here is my code:
doc = Nokogiri::HTML(open("https://feeds.feedburner.com/audiodharma"))
talks = doc.css(".regularitem")
The css seems pretty straightforward, so I can't figure out why I keeping getting an empty array for 'talks'. Let me know if you see something I'm not-- Nokogiri beginner here. Thanks.
If you try to use cURL or another method to fetch the content directly, you will see your XML starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2enclosuresfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
...
As you can see, it's an XML file and not an HTML file. I'm not sure why, but Nokogiri doesn't know how to handle it properly when trying to open the URI by itself - it's an XML with some styling properties for browsers. It reads fine usually, I'm not sure about the specs more in depth.
One solution I found was to load the URL using RestClient and once the content loaded it was parsed properly. Also you should call Nokogiri's XML and call it by its name. Then the search by CSS method works without a problem:
doc = Nokogiri::XML(RestClient.get("https://feeds.feedburner.com/audiodharma"))
doc.css('.regularitem') # => has valid Nokogiri output

Adding Images or Thumbnails to Atom 1.0 Entries

This StackOverflow answer suggests that you should use HTML entry content and use a standard <img> tag to link to your images.
<content type="html">
<![CDATA[
<a href="http://test.lvh.me:3000/listings/341-test-pics?locale=en">
<img alt="test_pic" src="http://test.lvh.me:3000/system/images/20/medium/test_pic.jpg?1343246102" />
</a>
]]>
</content>
I have also found something called the Yahoo media extensions here which allows you to add custom additional elements.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="path_to_image.jpg" />
</entry>
</feed>
Google also seems to have its own similar extensions. See here.
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:g="http://base.google.com/ns/1.0">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<g:image_link>http://www.google.com/images/google_sm.gif</g:image_link>
</entry>
</feed>
My own intuition tells me I should simply be able to add links to images like so:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<link rel="enclosure" type="image/png" length="1337"
href="http://example.org/image.png"/>
</entry>
</feed>
What is the correct approach for maximum compatibility?
The best practice is to do what Wordpress RSS 2.0 feeds do — if you want your post image to appear in feedly for example, put the <p><img...></p> at the top of the content. My eleventy setup has post header image inside article, but outside content variable's contents which are used in the feed. I solve the problem adding the image back:
<item>
...
<content:encoded>
<![CDATA[<p>{% include "src/components/partials/post-hero-img.njk" %}</p>{{ post.templateContent | textDeletePresentationDivs | htmlToAbsoluteUrls(absolutePostUrl) | safe }}]]>
</content:encoded>
source in git
I checked, neither Atom nor RSS 2.0 feeds have post images set anywhere as standalone tags. They're simply at the top of the article's content.
With regards to your examples...
The "vanilla" Atom RSS feed has a schema xmlns="http://www.w3.org/2005/Atom" and its documentation is defined in RFC4287.
According to it, "vanilla" Atom RSS feed strictly can have <logo> which is the 2:1 ratio image, the logo of the feed. Sadly, it is placed in the root of XML (notice atom:logo in the spec, it's not atom:entry:logo). Practically, this means, you can put a picture of your RSS feed itself, but not per-article. If you do put <logo> inside <entry>, the feed won't pass the validators and post image won't appear in feedly (I tried).
Also, spec defines <icon> which is vaguely defined as a small, square image, also placed in the root. Feedly seem to detect the website's favicon anyway, although it doesn't hurt to set this tag up in rss explicitly.
That's all there is — Atom spec doesn't officially define a way how to put images per-article.
Here's where additional namespaces come in (or RSS 2.0, different spec, different XML). You mentioned xmlns:media="http://search.yahoo.com/mrss/" in example. I tried it, post images won't show in feedly. Plus, spec link http://search.yahoo.com/mrss/ is not showing any specs.
Google namespace you quoted, xmlns:g="http://base.google.com/ns/1.0" also doesn't work, post images don't show up in feedly.
The link approach, <link rel="enclosure" type="image/png" length="1337" href="http://example.org/image.png"/> would be promising except length is meant to state the filesize in bytes. In Eleventy that's problematic value to get, for example.
To sum up, the best practice is put post header image at the top of the content, inside <content>.

Nokogiri to_xhtml puts doctype before <?xml

I'm trying to use Nokogiri to parse and update some xhtml files (fixing image sizes).
The parsing and updating works well but when I save the document with:
doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')
The first two lines change from (original):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
to (output):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<?xml version="1.0" encoding="utf-8"??>
which isn't a valid xml document (and there's also a double ? at the end of the xml tag).
Am I doing some wrong?
Edit: I've got nokogiri (1.6.0) installed, which seems to be the latest version.
This problem is an open (though very old) Nokogiri issue on Github, though it may in fact be a libxml issue. I was able to replicate your output.
The quick fix is to parse your document with Nokogiri::XML rather than Nokogiri::HTML, which is probably better practice anyway when dealing with XHTML files:
doc = Nokogiri::XML(open 'wherever')
doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')
Note that this won't preserve your XML processing instruction. If you need it, use to_xml.

How can I add parameters to the Yahoo MRSS feed generated by Kaltura KMC?

I'm using the Kaltura KMC to generate a Yahoo! MRSS feed (per the info here).
The feed it creates looks like this:
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/" xmlns:dcterms="http://purl.org/dc/terms/">
<channel>
<title>yahoo mrss feed</title>
<link>http://xxxx.com</link>
<description></description>
<item>
<title>My Dog Clip</title>
<link>http://xxxx.com?videoid=0_udwmgjec</link>
<media:content url="http://xxxx.com/p/100/sp/10000/serveFlavor/flavorId/0_e5h0z4cf">
<media:title>My Dog Clip</media:title>
<media:description>Here is a clip of the dog playing!</media:description>
<media:keywords>dog clip</media:keywords>
<media:thumbnail url="http://xxxx.com/p/100/sp/10000/thumbnail/entry_id/0_udwmgjec/version/100002"></media:thumbnail>
<media:category scheme="http://search.yahoo.com/mrss/category_schema">Entertainment & TV</media:category>
<media:player url="http://xxxx.com/kwidget/wid/_100/entry_id/0_udwmgjec/ui_conf_id/48501"></media:player>
<media:rating scheme="urn:simple"></media:rating>
</media:content>
</item>
</channel>
</rss>
This is pretty good, but I see two things that need adjusting:
On the <media:content> tag, I'd like to add the type parameter, indicating the MIME type. Is there a way to do this through the KMC interface?
I'd like to change the default size of the thumbnail it generates (and also add the image suffix, like .jpg, to the end of the URL). Is there an option for that in the KMC?
It seems like I might end up needing to use the API to build the MRSS feed myself on the fly (pulling the video data from Kaltura via the API). What do you think?
Thank you...
You can use the dynamic MRSS
To upload your owned XSD, and modify the original.

XSLT to display news item images with consistent size

I have a RSS XML news file, which contains a list of items inclusive of a URL to an image. I also have an associated XSLT.
The problem is that the image sizes are not consistent and I want to limit the image sizes, resize them, to a nice thumbnail.
How would I modify the XSLT to accomplish that?
XML Sample:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version ="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<title>Company Name</title>
<description>Company description</description>
<link>http://www.mycompanyurl.com</link>
<item>
<title>News Item Title</title>
<link>http://www.whateverurl.com/</link>
<category>Space</category>
<pubDate>12 April 1961</pubDate>
<description>Software to reduce your job search to a half hour per day. all major job sites, job boards, classifieds. unemployment paperwork, CRM, interviews, more</description>
<image>
<url>~/App_Data/NewsControl/whatever.png</url>
<title>Whatever1</title>
<link>javascript:void(0)</link>
</image>
<g:id>1</g:id>
<g:brand>Whatever2</g:brand>
<g:condition>whatever3</g:condition>
<g:price>$whatever4</g:price>
<g:product_type>Whatever5</g:product_type>
</item>
</channel>
</rss>
Here is the associated XSLT:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<items>
<xsl:for-each select="//item">
<item Name="{position()}" HeaderText="{title}" Text="{description}" NavigateUrl="{position()}" Date="{pubDate}" ImageUrl="{image/url}"/>
</xsl:for-each>
</items>
</xsl:template>
</xsl:stylesheet>
Results of First Answer:
<items>
<xsl:for-each select="//item">
<item Name="{position()}" HeaderText="{title}" Text="{description}" NavigateUrl="{position()}" Date="{pubDate}" ImageUrl="/Tools/thumber.php?img={image/url}"/>
</xsl:for-each>
</items>
I made these changes, enabled PHP on the server (testing on from the server and locally), and saw 2 issues:
1. I get no image, merely a no image box.
If I try to edit the ImageUrl and tack on a "&W=xxx&H=xxx", the Visual Studio validator complains and throws up errors on the &.
Update 2
Here is the latest line in the XSLT:
http://myserver.com/Tools/thumber.php?img=',image/url)}"/>
The corresponding image section in the XML
<image>
<url>/Products/Jobfish/Images/Boxshots/Jobfish_DVDCaseCD_ShadowOut.jpg</url>
<title>Jobfish</title>
<link>javascript:void(0)</link>
XSLT has no built in function for resizing or thumbnailing. You will have to use an external processor for that eg. by using a PHP thumbnail generator.
Then replace the original image path with a URL pointing to your thumbnail generator, with the source being the original image.
suppose ImageUrl = mediaserver.xyz/ourlogo.jpg
the new ImageUrl would become myserver.com/thumbnailgenerator.php?src=http://mediaserver.xyz/ourlogo.jpg
Please make shure you select a caching thumbnail library (eg https://code.google.com/p/phpthumbmaker/wiki/ThumberWiki ) , since it will be a serious resource hog if you skip that. Also take into account copyright issues when re-serving these thumbnails.

Resources