Content detection algorithms - algorithm

I am trying to reproduce the "content-detection" of webpages done by Clearly.
Given a webpage, I want to automatically distinguish text content, as opposed to text menus, text ads, text buttons, etc.
What algorithms are suited to detect text content from HTML pages?
[In the case of StackOverflow, the content would be the actual questions. All the rest is just "fluff around the content".]

You will probably want to take a look at Readability's algorithm.
What algorithm does Readability use for extracting text from URLs?

Related

Splitting Ruby strings into pages

I've been thinking about this problem for a while, and not quite sure the best way to go about it.
In a rails app I have books, which have many chapters, which have many sections. Chapters are basically just containers for sections, though may contain strings of text themselves. The sections hold most of the book text.
I'm planning to build an HTML 5 ebook reader that works in a mobile browser, and I don't want the user to have to scroll down -- I want the text to break at the end of the page.
I'd assumed using split might be the way to go, but I'm not sure there's a way to break at regular intervals? Would a javascript option work better here?
I'd looked at this: Dividing text article to smaller parts with paging in Ruby on Rails but can't feasibly insert manual break marks in the text, some of which are 90,000+ words.
Any ideas would be appreciated.
I think the main problem here is that the page length will depend on the device (and possibly the text size, if that is feature of your app). You should probably send large chunks that are sure to be at least say 5 pages long, at a time and then let the javascript do the paging. Rails has no access, nor should it, to the size of the display.
Text requires very little data, you shouldn't worry about transmitting more than you need or keeping too much in memory.
You may use blank line("\n" or "") as the separator.
I'd send enough of the page content down to easily fill a page and more, then use javascript on the client slide to remove sentences from the page until the scroll-bar disappears.
Resize.js is something similar I wrote a while ago. I wanted to enlarge/reduce the font size used on a screen until the screen was just full (for a dashboard monitor).. Yours would be similar, but instead of changing the font size, you are trimming off sentences.
Let me know if you can't see how to adapt this code.
Note: I would also make the javascript note the amount of text it ends up displaying, and pass that to the server in the 'next page' request, so the server knows where to start the next page from.

How to identify tags (key words) automatically from a given text?

It should behave like Delicious toolbar for Firefox does; it lists possible tags to click. The effect is shown as below:
The code should be able to find key words for the text. Any good algorithm or open source project to recommend?
I found this post, but it is a bit too general for my specific need.
I think you're looking for one of these answers,
tag generation from a text content
How to extract common / significant phrases from a series of text entries
tag generation from a small text content (such as tweets)
In a nutshell - you're looking to extract unigrams from the text that somehow represent the concepts within it - a technique to do this is called Pointwise Mutual Information, which is illustrated with an example in the first two links. Using the Python NLTK framework (which already has a bunch of these algorithms built in) might be your best starting point to work off from.
Good luck!

Algorithm for placing textual/non-textual content in a book layout

Hey guys I was looking for different approaches/algorithms for placing textual/non-textual content in a book layout having 2 sides. So essentially it should look like a user is reading a book & content placed in a 2 page layout.
If you guys have any directives or suggestions on how to go about doing this. Way to decide how many content items can fit into 2 pages, no overflow. Suppose a page is 425 px BY 600 px & we have 2 such pages fit side by side (dimensions are flexible).
Any pointers appreciated?
P.S. I know this is not a pure programming question per se but more of an algorithmic question. If so, please direct me where this question can be best asked.
EDIT 1
I want to use this algorithm in a website application & not in a standalone app, so please consider that.
EDIT 2
I would like to mention that the order of the content items is pre-decided.
If your goal is to display data in a book like format, then the easiest method would be to reuse an already existing toolkit for doing text layout. I think the best tool for this purpose would be LaTeX, which is an evolution of the original digital typesetting program.
In order to use it you will have to convert your data into the LaTeX format, which is relatively painless (I have done it several times with several types of data). In this document you can specify that you want a book format, how large the pages are, and much more. You can then render the text to pdf/ps and then display the two pages of a "book" side by side.
If what you are looking for is the actual algorithms to do it yourself, you might search around the TeX/LaTeX community for information.

how to make non copyable html page like google book

I am just curious if I can be able to copy books from google or not.And I am also curious to know what to make such kind of material.
I suppose the best way is to convert the text pages to images. You'd still be able to capture the images, but they wouldn't be in text form anymore; to get them back in their original form, you'd have to OCR them, which is an arduous process.

PDF generation under ruby - block should not cut by page separator

My PDF consists of a number of blocks (actually, a list of quotations), they go one after another till the end of the document. If the text of a quotation
does not fit on the page, the whole quotation should start from the top of the next page, instead of being torn apart. How can I implement that on any library under ruby?
Try PrinceXML - this is a standalone executable that generates PDF out of HTML or XML. It supports a lot of special CSS properties that will even help you to control page breaks. Refer to http://www.princexml.com/doc/6.0/page-breaks/
This application is available for windows and linux. I was using it for generation of a pretty complicated PDF documents with headers and footers on every page except first one. And since you don't need to output a PDF with precise positioning of elements, it might be a perfect solution for you.
I haven't tried it, but in Prawn I would try using either the Document#text_box method or looking up the table methods and putting your text in cells with invisible borders. The documentation's unclear on how page break functionality fits in with the bounding box models, but it's worth a shot.
HTMLDoc which converts HTML to PDF has a page break facility.

Resources