Nokogirl scraping talk from AudioDharma page - ruby

Here is my code:
doc = Nokogiri::HTML(open("https://feeds.feedburner.com/audiodharma"))
talks = doc.css(".regularitem")
The css seems pretty straightforward, so I can't figure out why I keeping getting an empty array for 'talks'. Let me know if you see something I'm not-- Nokogiri beginner here. Thanks.

If you try to use cURL or another method to fetch the content directly, you will see your XML starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2enclosuresfull.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>
<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
...
As you can see, it's an XML file and not an HTML file. I'm not sure why, but Nokogiri doesn't know how to handle it properly when trying to open the URI by itself - it's an XML with some styling properties for browsers. It reads fine usually, I'm not sure about the specs more in depth.
One solution I found was to load the URL using RestClient and once the content loaded it was parsed properly. Also you should call Nokogiri's XML and call it by its name. Then the search by CSS method works without a problem:
doc = Nokogiri::XML(RestClient.get("https://feeds.feedburner.com/audiodharma"))
doc.css('.regularitem') # => has valid Nokogiri output

Related

Ruby: how to generate HTML from Markdown like GitHub's or BitBucket's?

On the main page of every repository in GitHub or BitBucket it shows the Readme.md in a very pretty format.
Is there a way to make the same thing with ruby? I have already found some gems like Redcarpet, but it never looks pretty. I've followed this instructions for Redcarpet.
Edit:
After I tried Github's markup ruby gem, the same thing is happening.
What is shown is this:
And what I want is this:
And I'm sure it's not only css missing, because after 3 backquotes (```) I write the syntax like json or bash and in the first image it is written.
Edit2:
This code here:
renderer = Redcarpet::Render::HTML.new(prettify: true)
markdown = Redcarpet::Markdown.new(renderer, fenced_code_blocks: true)
html = markdown.render(source_text)
'<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>'+html
Generated this:
Github provides its own ruby gem to do so: https://github.com/github/markup.
You just need to install the right dependencies and you're good to go.
You need to enable a few nonstandard features.
Fenced code blocks
Fenced code blocks are nonstandard and are not enabled by default on most Markdown parsers (some older ones don't support them at all). According to Redcarpet's docs, you want to enable the fenced_code_blocks extension:
:fenced_code_blocks: parse fenced code blocks, PHP-Markdown style. Blocks delimited with 3 or more ~ or backticks will be considered as code, without the need to be indented. An optional language name may be added at the end of the opening fence for the code block.
Syntax Highlighting
Most Markdown parsers to not do syntax highlighting of code blocks. And those that do always do it as an option. Even then, you will still need to provide your own CSS styles to have the code blocks styled properly. As it turns out, Redcarpet does include support for a prettify option to the HTML renderer:
:prettify: add prettyprint classes to <code> tags for google-code-prettify.
You will need to get the Javascript and CSS from the google-code-prettify project to include in your pages.
Solution
In the end you'll need something like this:
renderer = Redcarpet::Render::HTML.new(prettify: true)
markdown = Redcarpet::Markdown.new(renderer, fenced_code_blocks: true)
html = markdown.render(source_text)
As #yoones said Github shares their way to do it but to be more precise they use the gem "commonmarker" for markdown. Though as far as I can tell this thing does not give the full formatted HTML file but only a piece that you insert into <body>. So you can do it like I did:
require "commonmarker"
puts <<~HEREDOC
<!DOCTYPE html>
<html>
<head>
<style>#{File.read "markdown.css"}</style>
</head>
<body class="markdown-body Box-body">
#{CommonMarker.render_html ARGF.read, %i{ DEFAULT UNSAFE }, %i{ table }}
</body>
</html>
HEREDOC
Where did I get the markdown.css? I just stole the CSS files from an arbitrary Github page with README rendered and applied UNCSS to it -- resulted in a 26kb file, you can find it in the same repo I just linked.
Why the table and UNSAFE? I need this to render an index.html for Github Pages because their markdown renderer can't newlines within table cells, etc. so instead of asking it to render my README.md I make the index.html myself.

Unable to get certain elements from RSS feed

I'm having trouble with the default ruby rss parser, specifically getting a couple elements from the Rogan podcast feed:
http://joeroganexp.joerogan.libsynpro.com/rss
open(url) do |rss|
feed = RSS::Parser.parse(rss)
Now when I look at feed, I can't find the following:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:cc="http://web.resource.org/cc/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
And strangely enough I can get all the channel data except the last one (itunes:type):
<itunes:type>episodic</itunes:type>

Adding Images or Thumbnails to Atom 1.0 Entries

This StackOverflow answer suggests that you should use HTML entry content and use a standard <img> tag to link to your images.
<content type="html">
<![CDATA[
<a href="http://test.lvh.me:3000/listings/341-test-pics?locale=en">
<img alt="test_pic" src="http://test.lvh.me:3000/system/images/20/medium/test_pic.jpg?1343246102" />
</a>
]]>
</content>
I have also found something called the Yahoo media extensions here which allows you to add custom additional elements.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="path_to_image.jpg" />
</entry>
</feed>
Google also seems to have its own similar extensions. See here.
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:g="http://base.google.com/ns/1.0">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<g:image_link>http://www.google.com/images/google_sm.gif</g:image_link>
</entry>
</feed>
My own intuition tells me I should simply be able to add links to images like so:
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<!-- ommitted -->
<entry>
<!-- ommitted -->
<link rel="enclosure" type="image/png" length="1337"
href="http://example.org/image.png"/>
</entry>
</feed>
What is the correct approach for maximum compatibility?
The best practice is to do what Wordpress RSS 2.0 feeds do — if you want your post image to appear in feedly for example, put the <p><img...></p> at the top of the content. My eleventy setup has post header image inside article, but outside content variable's contents which are used in the feed. I solve the problem adding the image back:
<item>
...
<content:encoded>
<![CDATA[<p>{% include "src/components/partials/post-hero-img.njk" %}</p>{{ post.templateContent | textDeletePresentationDivs | htmlToAbsoluteUrls(absolutePostUrl) | safe }}]]>
</content:encoded>
source in git
I checked, neither Atom nor RSS 2.0 feeds have post images set anywhere as standalone tags. They're simply at the top of the article's content.
With regards to your examples...
The "vanilla" Atom RSS feed has a schema xmlns="http://www.w3.org/2005/Atom" and its documentation is defined in RFC4287.
According to it, "vanilla" Atom RSS feed strictly can have <logo> which is the 2:1 ratio image, the logo of the feed. Sadly, it is placed in the root of XML (notice atom:logo in the spec, it's not atom:entry:logo). Practically, this means, you can put a picture of your RSS feed itself, but not per-article. If you do put <logo> inside <entry>, the feed won't pass the validators and post image won't appear in feedly (I tried).
Also, spec defines <icon> which is vaguely defined as a small, square image, also placed in the root. Feedly seem to detect the website's favicon anyway, although it doesn't hurt to set this tag up in rss explicitly.
That's all there is — Atom spec doesn't officially define a way how to put images per-article.
Here's where additional namespaces come in (or RSS 2.0, different spec, different XML). You mentioned xmlns:media="http://search.yahoo.com/mrss/" in example. I tried it, post images won't show in feedly. Plus, spec link http://search.yahoo.com/mrss/ is not showing any specs.
Google namespace you quoted, xmlns:g="http://base.google.com/ns/1.0" also doesn't work, post images don't show up in feedly.
The link approach, <link rel="enclosure" type="image/png" length="1337" href="http://example.org/image.png"/> would be promising except length is meant to state the filesize in bytes. In Eleventy that's problematic value to get, for example.
To sum up, the best practice is put post header image at the top of the content, inside <content>.

Nokogiri to_xhtml puts doctype before <?xml

I'm trying to use Nokogiri to parse and update some xhtml files (fixing image sizes).
The parsing and updating works well but when I save the document with:
doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')
The first two lines change from (original):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
to (output):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<?xml version="1.0" encoding="utf-8"??>
which isn't a valid xml document (and there's also a double ? at the end of the xml tag).
Am I doing some wrong?
Edit: I've got nokogiri (1.6.0) installed, which seems to be the latest version.
This problem is an open (though very old) Nokogiri issue on Github, though it may in fact be a libxml issue. I was able to replicate your output.
The quick fix is to parse your document with Nokogiri::XML rather than Nokogiri::HTML, which is probably better practice anyway when dealing with XHTML files:
doc = Nokogiri::XML(open 'wherever')
doc.to_xhtml(:indent_text => "\t", :indent=>1, :encoding => 'UTF-8')
Note that this won't preserve your XML processing instruction. If you need it, use to_xml.

contentEditable on nodes in a XML/compound document?

I have an XML document that I'm displaying in a web browser, with a stylesheet attached:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?xml-stylesheet type="text/css" href="abc.css"?>
<myxml xmlns:xhtml="http://www.w3.org/1999/xhtml">
<para>I wish i was editable</para>
<xhtml:script type="text/javascript" src="abc.js"/>
</myxml>
With the xhtml namespace declaration, and xhtml:script tag, I can execute javascript.
What I'd like to do is to make arbitrary non-XHTML elements in this document content editable. (Actually, they'll be in another namespace)
Even if I explicitly add #contentEditable="true" (ie without resorting to Javascript), the content is not actually editable (in Firefox 3.0.4).
Is it possible to edit it in any of the current browsers? (I had no problems with <div contentEditable="true">Edit me</div> in an XHTML 1.0 Transitional doc)
I can't even edit an xhtml:div in this document (in Firefox); if I could do that, that may offer a way forward.
In Firefox 3, #content-editable="true" only makes the relevant element editable if the
content type is text/html (which also happens if a local filename ends with .html)
It doesn't work for content types app/xhtml+xml or text/xml (local filenames ending with .xhtml or .xml)
I've logged an enhancement for this: https://bugzilla.mozilla.org/show_bug.cgi?id=486931
contentEditable works (tested in Firefox and Chrome) on elements which are foreign to html/xhtml if I use this doctype:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
and a .html file extension (instead of .xml).
I don't have to include any html elements at all (eg head, body, div, p).
css isn't applied though (if my xml is in a namespace, which i guess makes sense, given the doctype!).
Not an elegant solution.
Firefox is one of the few browsers that strictly enforces the XHTML spec. So, to make an element editable, you must specify the contenteditable attribute as true. Note that the whole attribute name is lower case. In your example the first "E" in editable was capitalized.
Another quirk that should be mentioned is that IE(6,7,8) act exactly the opposite. To make an element editable in IE, you MUST add contentEditable="true" exactly. For what ever reason, contenteditable="true" (as well as any other variation in capitalization) does not work.

Resources