algorithm to find 'article' in webpage?

algorithm to find 'article' in webpage? - algorithm

some browser plugin, like readability can extract the 'article' from a webpage. Does anyone has idea about how to do it? What's the difference between the real articles and ads or comments?

Well, it depends how you want to define "real articles"...
Taking HTML5 into consideration, a webpage is constructed of semantic tags. Pages no longer have to be built with elements like <div> that have exactly no semantic meaning. In HTML5 you may use <section>, <article>, <header> and so on. Those elements can give an application pretty good sense of what is the main content of a webpage (e.g. print <article>s and skip <nav>s...)
Of course, not many pages use those tags yet. Furthermore, the tags might get abused and lose their meaning. In that case I'd stick to some statistics, e.g. selecting the largest elements in a HTML document. Moreover, if you have to scrape a webpage, you could use a modification of some pattern-matching algorithm, DIPRE for instance.

Related

CSS selector performance - pseudo selectors

CSS Selectors are parsed right to left, and then displayed.
With this in mind, based on this code:
<a href="#" class="myImage">
<img>
</a>
Which is more performant?:
.myImage img
or
.myImage img:only-child
Does :only-child help specificity in selector selection? So instead of initially matching all <img> in the document then looking for the parent class, does it only match <img>'s that are the only child (thereby reducing the pool of selections)?
reference read: http://csswizardry.com/2011/09/writing-efficient-css-selectors/
EDIT
Found further reading:
The sad truth about CSS3 selectors is that they really shouldn’t be
used at all if you care about page performance. Decorating your markup
with classes and ids and matching purely on those while avoiding all
uses of sibling, descendant and child selectors will actually make a
page perform significantly better in all browsers.
-David Hyatt (architect for Safari and WebKit, also worked on Mozilla, Camino, and Firefox)
Source: http://www.stevesouders.com/blog/2009/03/10/performance-impact-of-css-selectors/

Alright, so this depends completely on the use case. What I'm going to do is handle cases.
Point 1 - Concerning given example
Case A
Say your code is simply:
<a href="#" class="myImage">
<img>
</a>
In this case .myImage img and .myImage img:only-child will have essentially identical performance hits. Reason being is as follows:
CSS performance hits are derived from having to style or re-style boxes. What CSS does is affect your graphics processor, not the CPU, so we only care about visual updates. And styling things again with duplicate properties DOES cause a redraw (By this I mean duplicate properties that are applied at a later time, which selectors generally cause).
Usually, .myImage img would style ALL <img>s in .myImage. While using the only-child selector may not style all (obviously).
But in this example, both .myImage img and .myImage img:only-child would do identical things, since they would both cause 1 draw/redraw on the same box.
Case B
But suppose we have:
<a href="#" class="myImage">
<img>
<img>
</a>
Here though, we have a different story.
In this example, only-child wouldn't work at all, since <img> is not the only child.
Case C
Finally, suppose you have this, but you only want to style the under the :
<a href="#" class="myImage">
<img>
</a>
<div class="myImage>
<img>
<img>
</div>
In this case, using the only-child selector would be significantly better performance-wise, since you only will style one element, instead of three.
Conclusion and Takeaway
Basically, remember a few things.
CSS selectors help you with writing code because you get to add less IDs and Classes, however, they are almost always less efficient than using all IDs and Classes, because using selectors will causes extra redraws. (Since you'll inevitably style multiple elements with some selector, and then re-style other things with IDs or classes, causing unnecessary redraws)
only-child is NOT faster than .class child if that .class only has one child.
Technically what you are comparing are two completely different things, used for different purposes.
Conclusion
In final answer to your question. In your shown example, neither is more efficient than the other since they cause identical redraws.
However, as
"Does :only-child help specificity in selector selection?"
goes: Yes, that's the point of the selector. See: http://www.w3schools.com/cssref/sel_only-child.asp
Sources:
I am a significant volunteer developer for Mozilla Thunderbird and the Mozilla project and work in the CSS area.
FYI
There are of course weird exceptions to this, but I won't go over them here since I think your exact question doesn't give a brilliant example.
Point 2 - Concerning speed of find selectors
I am purposely trying to drive home the point that it is the drawing the causes the CSS perf hit, not finding the selectors. However, the reason I say this is not because finding selectors takes no time, but instead because it's time is miniscule to the time caused by drawing. That said, if you did have 5,000 <div>s or something, and attempted to style a few using pseudo selectors, it would definitely a little longer than using CSS classes and IDs.
Again though, in your example it would make no difference, since it would look through each element anyway. Why only-child is helpful perf-wise is because it would stop searching for some element after it finds more than one child, whereas simply doing class child would not.
The problem with pseudo selectors is that they usually require a lot of extra searching AND it happens after IDs, classes, and such.
The links you provided are actually very helpful and I'm surprised they didn't answer your question.
One thing I should point out is that many selectors believed to be slow may be vastly improved now. See: http://calendar.perfplanet.com/2011/css-selector-performance-has-changed-for-the-better/
Remember selector performance is only important on very large websites.

HTML efficiency, best practices

In the past, I created some divs to act like articles. Now I am thinking about changing it to HTML5 tag article...
Is there an important diference (in terms of efficiency) between using HTML elements or using equivalent divs created by the user?
For example: Will the browser load the pages faster if they are built only with HTML elements?

Short answer: No.
Long answer: maybe, if it will decrease the amount of markup you use. But not likely.
The benefit of using semantic tags is to add more meaning to the markup, not improve performance.

May be. When you cretae a div and add styling to it, the browser needs to first interpret the element and then process the style over it and render it. If you use the appropriate HTML element, it would put less burden on the rendering engine.

Web page summary with Ruby

Can anyone recommend a Ruby library for creating a summary of a given URL? What I have in mind is the sort of one- or two-sentence summary as seen in search engine results.

You could you just scrape the web page for either description meta tag or if that's not available the first few sentences from the first <p> element on the page. The description meta tag looks like this:
<meta name="description" content="Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser with XPath and CSS selector support." />
There's several Ruby libraries for parsing HTML. I hear that Nokogiri is good for this sort of stuff, but I have no experience with it personally.

Spidering a site and scraping pages is easy. Summarizing a page is difficult.
The metatags can help a little, as there is supposed to be a direct correlation between the summary and the content.
Unfortunately, not all pages have them, and many that do are inaccurate. That leaves us with having to scape text, hoping that it's pertinent to the content and context. Page layouts vary and there is no standard saying where on a page the main content actually lies and, because of CSS and Ajax, it might not be where we'd expect it, in the first couple lines of text. There might not be <p> tags, as a <div> or <span> with the appropriate CSS can replace the look.
I've written many spiders that did contextual analysis of the pages, trying to summarize, and it's ugly and not bullet-proof, especially when dealing with the English language because of homonyms, synonyms, and other "nyms" that get in the way.
If you can locate text to summarize, there are decent tools to reduce several paragraphs, or a paper, into a short sentence. Mac OS comes with a summarizer, and has for years. "Summarize Text Using Mac OSX Summarize Or Microsoft Word AutoSummarize" talks about enabling it if you want to experiment. "Mac 101: Shorten text using the Summarize Service" is about using it on the Mac. There's a driver or app for it that can be called from the CLI. See "How to use Mac OS X's Summary Service on the command line?" for more info.
And, as a demo, here's Lincoln's Gettysburg address summarized to one line:
It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.

extract xpath

I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?

You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.

The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.

I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio