Map RSS entries to HTML body w. non-exact search - algorithm

How would you solve this problem?
You're scraping HTML of blogs. Some of the HTML of a blog is blog posts, some of it is formatting, sidebars, etc. You want to be able to tell what text in the HTML belongs to which post (i.e. a permalink) if any.
I know what you're thinking: You could just look at the RSS and ignore the HTML altogether! However, RSS very often contains only very short excerpts or strips away links that you might be interested in. You want to essentially defeat the excerptedness of the RSS by using the HTML and RSS of the same page together.
An RSS entry looks like:
title
excerpt of post body
permalink
A blog post in HTML looks like:
title (surrounded by permalink, maybe)
...
permalink, maybe
...
post body
...
permalink, maybe
So the HTML page contains the same fields but the placement of the permalink is not known in advance, and the fields will be separated by some noise text that is mostly HTML and white space but also could contain some additional metadata such as "posted by Johnny" or the date or something like that. The text may also be represented slightly different in HTML vs. RSS, as described below.
Additional rules/caveats:
Titles may not be unique. This happens more often than you might think. Examples I've seen: "Monday roundup", "TGIF", etc..
Titles may even be left blank.
Excerpts in RSS are also optional, but assume there must be at least either a non-blank excerpt or a non-blank title
The RSS excerpt may contain the full post content but more likely contains a short excerpt of the start of the post body
Assume that permalinks must be unique and must be the same in both HTML and RSS.
The title and the excerpt and post body may be formatted slightly differently in RSS and in HTML. For example:
RSS may have HTML inside of title or body stripped, or on the HTML page more HTML could be added (such as surrounding the first letter of the post body with something) or could be formatted slightly differently
Text may be encoded slightly differently, such as being utf8 in RSS while non-ascii characters in HTML are always encoded using ampersand encoding. However, assume that this is English text where non-ascii characters are rare.
There could be badly encoded Windows-1252 horribleness. This happens a lot for symbol characters like curly quotes. However, it is safe to assume that most of the text is ascii.
There could be case-folding in either direction, especially in the title. So, they could all-uppercase the title in the HTML page but not in RSS.
The number of entries in the RSS feed and the HTML page is not assumed to be the same. Either could have more or fewer older entries. We can only expect to get only those posts that appear in both.
RSS could be lagged. There may be a new entry in the HTML page that does not appear in the RSS feed yet. This can happen if the RSS is syndicated through Feedburner. Again, we can only expect to resolve those posts that appear in both RSS and HTML.
The body of a post can be very short or very long.
100% accuracy is not a constraint. However, the more accurate the better.
Well, what would you do?

I would create a scraper for each of the major blogging engines. Start with the main text for a single post per page.
If you're lucky then the engine will provide reasonable XHTML, so you can come up with a number of useful XPath expressions to get the node which corresponds to the article. If not, then I'm afraid it's TagSoup or Tidy to coerce it into well formed XML.
From there, you can look for the metadata and the full text. This should safely remove the headers/footers/sidebars/widgets/ads, though may leave embedded objects etc.
It should also be fairly easy (TM) to segment the page into article metadata, text, comments, etc etc and put it into fairly sensible RSS/Atom item.
This would be the basis of taking an RSS feed (non-full text) and turning it into a full text one (by following the permalinks given in the official RSS).
Once you have a scraper for a blog engine, you can start looking at writing a detector - something that will be the basis of the "given a page, what blog engine was it published with".
With enough scrapers and detectors, it should be possible to point a given RSS/Atom feed out and convert it into a full text feed.
However, this approach has a number of issues:
while you may be able to target the big 5 blog engines, there may be some blogs which you just have to have that aren't covered by them: e.g. there are 61 engines listed on Wikipedia; people who write their own blogging engines each need their own scraper.
each time a blog engine changes versions, you need to change your detectors and scrapers. More accurately, you need to add a new scraper and detector. The detectors have to become increasing more fussy to distinguish between one version of the same engine and the next (e.g. everytime slashcode changes, it usually changes the HTML, but different sites use different versions of slash).
I'm trying to think of a decent fallback, but I'll edit once I have.

RSS is actually quite simple to parse using XPath any XML parser (or regexes, but that's not recpmmended), you're going through the <item> tags, looking for <title>, <link>, <description> .
You can then post them as different fields in a database, or direcrtly merge them into HTML. In case the <description> is missing, you could scrape the link (one way would be to compare multiple pages to weed-out the layout parts of the HTML).

Related

Laravel string limit with html tag shows less item than the actual number of items

I am using laravel and blade to loop over some blog items. I wanted to show blog with html tags and also with str_limit function.
When I try
{!!str_limit($blog->body, 450)!!}
It only shows 10 blogs or 11 out of 22. It should show all items.
If I use {{ str_limit($blog->body, 450) }} it shows all but without html tags I mean no effect of html tag.
str_limit has no knowledge of HTML tags, so using it on a string that contains HTML will often result in an unclosed tag that breaks the rest of the page.
As an example, an excerpt that ends in <a href="http://google.co because it got lopped off there means the rest of your page is part of the <a> tag until you accidentally output a " and a > again.
A couple options are available to you:
Strip the HTML tags. I know you wanted to preserve them, but this remains the easiest way of generating an excerpt.
Output the entire body, but give it max height and an overflow: hidden to hide the rest. This has bandwidth downsides, so if your posts are enormously long, it may not be the best approach.
Produce your own excerpts as a separate field. Manual work, but you're always in control that way.
Find/code a HTML-aware excerpt generator. I'm not aware of a good one I can recommend - it's a complicated problem. You could try generating a str_limited string and then running the results through Tidy, which can sort of fix invalid HTML.

algorithm to find 'article' in webpage?

some browser plugin, like readability can extract the 'article' from a webpage. Does anyone has idea about how to do it? What's the difference between the real articles and ads or comments?
Well, it depends how you want to define "real articles"...
Taking HTML5 into consideration, a webpage is constructed of semantic tags. Pages no longer have to be built with elements like <div> that have exactly no semantic meaning. In HTML5 you may use <section>, <article>, <header> and so on. Those elements can give an application pretty good sense of what is the main content of a webpage (e.g. print <article>s and skip <nav>s...)
Of course, not many pages use those tags yet. Furthermore, the tags might get abused and lose their meaning. In that case I'd stick to some statistics, e.g. selecting the largest elements in a HTML document. Moreover, if you have to scrape a webpage, you could use a modification of some pattern-matching algorithm, DIPRE for instance.

How to retrieve plain text from a formatted website to use in UIWebView

Not sure if what I want to do is possible, but what I am hoping to do is somehow gather certain pieces of text from a website, remove the header, footer, background, all formatting, and place it into my application in a scrollview or something similar...
I'll give you an example... Imagine I was making wikipedia's iPhone app, I want to download the information about the wiki on dogs, without the header, side bars etc, just the text. How would I go about doing this?
I understand that for this I have not provided any example code or what I've tried or started, but that's just because in this case I'm lost! That doesn't mean I want full chunks of code either. Any help will do. If this doesn't work, I will just have to make a 'mobile optimised' version of the webpages I want to include in my app.
Thanks
(Edit: the term I was trying to use was 'strip the web page of its HTML coding')
You may be going about this the wrong way, or perhaps even asking the wrong question.
Does the target website have an API or datafeed of some kind?
Can you get the information you need in JSON or XML format directly from the site?
I think you've misunderstood the technology. HTML is merely the framwork on which the formatting and data is hung.
Parsing the HTML page seems like an awfully big headache, I doubt you'll ever be able to get it to work, because almost all sites these days are partially or wholly generated on the server side, the page is only the result.
Some sites hide the information in memory and others get it dynamically through ajax for example, which means that simply trying to get the data by parsing the HTML will get zero data.
Another issue you should be aware of though, is that simply copying the data from generated websites may open yourself up to copyright issues.
You have to parse the html code and search for the part that you want and "throw" away the part that you do not need. This is more or less like bruteforcing and the code of the website should not change otherwise you are screwed. So you have to write the parser by hand with this method. But maybe there is a atom or rss feed and you can parse this one. This will be much more easier and you are not depending on the website layout because the rss/atom feed is just about the data. For parsing rss you could try out NSXMLParser.
And then you have to make a valid html page out of the data and present it in the UIWebView

Web page summary with Ruby

Can anyone recommend a Ruby library for creating a summary of a given URL? What I have in mind is the sort of one- or two-sentence summary as seen in search engine results.
You could you just scrape the web page for either description meta tag or if that's not available the first few sentences from the first <p> element on the page. The description meta tag looks like this:
<meta name="description" content="Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support." />
There's several Ruby libraries for parsing HTML. I hear that Nokogiri is good for this sort of stuff, but I have no experience with it personally.
Spidering a site and scraping pages is easy. Summarizing a page is difficult.
The metatags can help a little, as there is supposed to be a direct correlation between the summary and the content.
Unfortunately, not all pages have them, and many that do are inaccurate. That leaves us with having to scape text, hoping that it's pertinent to the content and context. Page layouts vary and there is no standard saying where on a page the main content actually lies and, because of CSS and Ajax, it might not be where we'd expect it, in the first couple lines of text. There might not be <p> tags, as a <div> or <span> with the appropriate CSS can replace the look.
I've written many spiders that did contextual analysis of the pages, trying to summarize, and it's ugly and not bullet-proof, especially when dealing with the English language because of homonyms, synonyms, and other "nyms" that get in the way.
If you can locate text to summarize, there are decent tools to reduce several paragraphs, or a paper, into a short sentence. Mac OS comes with a summarizer, and has for years. "Summarize Text Using Mac OSX Summarize Or Microsoft Word AutoSummarize" talks about enabling it if you want to experiment. "Mac 101: Shorten text using the Summarize Service" is about using it on the Mac. There's a driver or app for it that can be called from the CLI. See "How to use Mac OS X's Summary Service on the command line?" for more info.
And, as a demo, here's Lincoln's Gettysburg address summarized to one line:
It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm
First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.
There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.
I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Resources