Getting most relevant content from page - xpath

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).

If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Related

Is there an OSM XAPI tag/value list?

I'm new to OSM querying, but would like to query vector data for a large area. Thus I need to limit the results I would like to get by tagging the request.
http://www.informationfreeway.org/api/0.6/way[tag=value][bbox=x,y,z,j]
I'd like to filter for specific tag/values when querying for a way. Though I don't know which tags/values exist. Is there a list listing the most common of them?
You are approaching your problem from the wrong direction. The number of different tags is almost unlimited. According to taginfo there are currently 75 380 856 different tags. I'm pretty sure you are not interested in most of them. Likewise you are probably not even interested in many of the most common tags.
What data do you want to query?
The OSM wiki should be your starting point for generating a list of tags you are interested in. For a generic overview take a look at the map features. Are you interested in streets? Then visit at the highway key. Routing? Then take a look at the routing wiki page.
Always remember that these lists aren't complete. People can use any tag they like (but should use well-established tags whenever possible of course).
Also consider using Overpass API instead of XAPI. Overpass API is much more powerful.

Generate PDF from content in Magnolia CMS

We would like to generate a PDF document for a single page. While only this link talks about this subject (and the other discussion linked from there), the information given is quite slim.
Could anybody share any success stories made so far including source-code?
Has someone succeeded in using wkhtmltopdf?
(we plan to use Magnolia 4.5.6)
After evaluating both Aspose.pdf (commercial product) and iText, we went to use LaTex. We had quite some specific requirements (e.g. two column layout with footnotes, very large table), which were not possible with the two above mentioned products.
We are very happy with this solution, but there are some things to be noted: first and foremost you leave the JVM, and second LaTex is itself another macro language to be learned. The quality of the outcome is very good, although, and we are very happy with that solution.
wkhtmltopdf is used in another project, and the outcome is also good, for more straight forward formatting.

Parsing HTML in AppleScript

What's a good way to parse HTML in AppleScript?
I haven't dabbled in AppleScript in quite some time, and even when I did it was very minimal and uninvolved, so I don't really think naturally in the language quite yet. But I need to do some string manipulation and parse some HTML (basically some simple screen scraping).
Naturally, I'd like to avoid common pitfalls of HTML parsing. However, this is a temporary script and doesn't need to be particularly robust or supportable. I really just need to scrape specific substrings (from a known starting substring to the next known character) into a file.
I've done plenty of string manipulation in C# and similar languages, but AppleScript is an interesting change of pace to say the least. Can somebody point me to some good resources (Google searches on this subject seem to have a high noise-to-signal ratio), or help me out with some sample code snippets?
The ultimate goal of what I'm doing is to take a pre-determined list of pages, open each one in Safari (I'm doing everything through tell application "Safari"), parse out links which fit a certain pattern, and store all of those links in a file. Then go through that file, open each of those links, parse out more links which fit another pattern, and store all of those links in a file.
(The site is actually owned by someone we're working with, so don't worry about me violating any terms of service or anything like that. But for reasons outside the scope of this question, I'm doing some page scraping in AppleScript.)
I can't say enough good things about Matt Neuburg's AppleScript: the Definitive Guide. Without a doubt the most complete documentation of AppleScript ever done. Matt's also one of my favorite tech writers.
I would also check out this article. It contains a tutorial on how to do this; the example provided there parses HTML data from only one source, but I think it's worth looking at.

What is a good approach for extracting keywords from user-submitted text?

I'm building a site that allows users to make sense of a debate by graphically representing arguments for and against a particular issue. (Wrangl)
I'd like to categorise these debates so they are more easily found and connected. I don't want to irritate the person creating the debate by asking them to add tags and categories before they see any benefit, so I'm looking at a way of automatically extracting keywords.
What's a good approach for taking the debate's title and description (and possibly the content of the arguments themselves once there are some) to pull out, say, ten strong keywords that could be used as metadata to connect similar debates together, or even as the content of the "meta" keywords tag in the head of the HTML page where the debate is viewable. Eg. Datamapper vs ActiveRecord
The site is coded in Ruby with Sinatra, using DataMapper for data storage. I'm ideally looking for something which will work on Heroku (I don't have a way of writing files to disk dynamically), and I'd consider a web service, an API or ideally a Ruby gem.
Maybe you can use TextAnalyzer.
I understand that you're wanting to find an easy way of achieving this, I've recently dived into the world of NLP (Natural Language Processing) and Text-mining and its a daunting process of which most went far above my head.
Although i managed to code some functionality that resembles what you're looking for, though I did it in PHP. What i would suggest, that if you want it tailored to your project (Wrangl) then do it yourself.
Using the Porter stemming algorithm which I'm sure there will be Ruby code for.
Ruby Porter stemmer
You can try the salsaAPI to automatically extract keywords and categorize the debates!

Algorithm to find out if a website is a blog?

This is a creative one :-)
I'll be receiving a list of hundreds of new URLs regularly and want to find out if they are linking to a blog or not - between 80% and 95% accuracy would be sufficient.
Obviously I need to analyze the HTML of the page - but how exactly would you approach this (e.g. meta tags, structural analysis, pattern matching, machine learning ...)?
I would look at the generator <meta> tag for known blog editors. For example here's how it looks for Wordpress:
<meta name="generator" content="WordPress.com" />
Building on Darin's solution, I would look for the generator <meta> tag for known blog editors and combine it with a lookup table of common sites, ie. WordPress.com, Blogspot.com, Livejournal.com, and so forth. That should give you 80-95% in the near term, though it won't be robust enough for an ongoing process over an extended period of time.
An extended solution is much harder, given the amorphous definition of the term "blog". In which case, you'll want to consider breaking the list down into its hosting site and defining characteristics and create hard and fast rules on what constitutes a blog:
Is it hosted by a blogging service provider?
Is it listed in a blog aggregator, such as Technorati?
Does it include blog-like services, such as user-generated articles, tags, and the ability to comment?
Does it provide meta information that I can use to easily identify it as a blog?
Does it otherwise identify itself as a blog, via the inclusion of the term "blog" or some other criteria?
I can easily see a neural network constructed to determine if a page is a blog or not, but this serverely oversteps the bounds of your requirements. I'd say start simple, then extend your solution relative to the proposed lifetime of your system.
The above suggestions are good, and probably will work if you're aiming for 80-90% accuracy.
I would go one step further and look for any .xml RSS feed in either a meta tag, or as a link. Then check the feed to see if there are any comment tags (since there are feeds for other purposes too). I would OMIT this for certain blog platforms that don't give you a feed such as Tumblr.

Resources