how can I handle inconsistent markup? - xpath

I have a project where I have to scrape many URLs from many pages. I thought the structure of every page would remain the same, but sometimes it changes and breaks my code.
I need to extract, for example, the abstract of an article and its keywords, both of which are in a separate <p> with the same class "marginB3". So I scraped a page and only got two results, one for the abstract and the other one for the keywords:
hxs = HtmlXPathSelector(response)
lista = hxs.select('//p[#class="marginB3"]/text()')
self.abstracto = lista[0].extract()
self.keywords = lista[1].extract()
I then tried with a third page and a new <p> appeared with some additional information about the article and altered the structure. That made it more complicated since there are no ids and only classes. How can I differentiate which one is the <p> for the keywords without id's if they have their own <h2> above them:
<h2>Info</h2>
<p class="marginB3">a_url_I_want</p>
Can I do this differentiation by reading that <h2> and then the <p> below it?

You certainly can.
Try this:
# First <p>
hxs.select('//h2/following-sibling::p[#class="marginB3"][1]/text()').extract()
# Second <p>
hxs.select('//h2/following-sibling::p[#class="marginB3"][2]/text()').extract()

I am not an XPATH expert, but I think you need to look at the following axis to catch the items after the <h2> tag.
In general, XPATH does poorly when the document you are trying to parse isn't well marked. At the risk of adding even more complexity, you could look at something like the BeautifulSoup module that would allow a more procedural way of coping with inconsistent markup. XPATH is a (mostly) declarative language and declarative languages have a hard time coping with non-regular input.

Related

Webscraping Selectors

At what level of hierarchy do you begin your selectors?
There seems to be a convention of beginning with the container of the target element, but why not ever the target element itself, especially in the case of an id or starting with a wildcard plus a unique identifier?
Recursive descent seems like everyone's best friend.
XPaths and Css-Selectors are very versatile, and can describe the same element in many different ways - i.e. an single element has infinitely many possible locators to describe it. The goal is to get something to fit the needs of the developer which might include being readable, unique, and or adaptive.
Consider the following html example:
<div id='mainContainer'>
<span>some span</span>
</div>
If I were trying to make a locator for the <span> element, I wouldn't choose //span, because that will probably yield way too many results. Instead you could start with its parent who has an id, and then proceed to the span: //*[#id='mainContainer']/span, and alternatively: //span[parent::*[#id='mainContainer']]. Which XPath is better? Whichever one you personally find more readable. I agree with you that the first example does seem to be more common, although I myself am more partial to the latter.
Sometimes the point of making a locator a certain way is to be adaptable. For instance, I rarely write a locator like this: //*[#class='fooBar']. The reason is because in modern web development classes come and go frequently, and it's likely that that element's class could change at the slightest breeze. Instead you might write //*[contains(#class,'fooBar')]. Now when a developer goes in and adds a class for pure styling, you don't have to go back and update all of your selenium tests. That is also the reason I use wildcard characters frequently. If a developer goes in and updates a div to a span, my test will still work.
As #Gilles Quenot commented, it isn't always safe to assume that ids are unique. Many websites were written by someone's unemployed uncle who took an html class back in '86. They are terrible, and don't care at all about standards or audits. This is another reason that you need to include enough information in your locator to specify the exact element/elements you are talking about, but not too much information that you are describing too many elements.
One more comment is that XPaths are bidirectional, whereas Css-Selectors are not. This means XPaths can go from child to parent and from parent to child, where Css-Selectors can only go from parent to child. This affects which node you are starting at, and may be a reason that you see more Css-Selectors start from a parent/ancestor node.
TL;DR There isn't a convention, just personal preferences. Do what meets your needs.

Microdata for dictionary : can I use yandex

I'm willing to use microdata/microformat/etc. for the part of my website which is an online dictionary. Basically I just want to tag word and definition to help search engines to grab the most important data in every page belonging to the dictionary, and maybe have Google use them as "rich snippets" in results page.
Main problem is it's hard to find dedicated vocabulary for words and definitions (no problem for recipes, movies and hotels though) and I'm not sure if I have to use the "http://schema.org/Article" tree for my lexicographic work. (To my mind, it makes sense to tag something when it's specific enough).
I have found something interesting at Yandex, for words and encyclopedia, I want to ask what to do with. See there :
https://yandex.ru/support/webmaster/microdata/what-is-microdata.xml?lang=en
https://yandex.com/support/webmaster/microdata/term-definition-markup.xml
It looks like it is very close to my request. But I'm sorry I dont know what is Yandex... will it work with Google ?
I'm asking here if that page, from Yandex, is a working model, is still on use, what are the pros and cons ? Will Google be able to use the specific vocabulary from Yandex and understand my Yandex-tagged data ? is it worth using that vocabulary for an online dictionary, or is something else I have missed of better use ?
(http://webmaster.yandex.ru/vocabularies/term-def.xml, which should be the vocabulary url, gives me a 404).
One more question, please : am I allowed to write (duplicate) the most important data in the header, something like (I believe I am, because Google microdata testing tool prooves to be able to extract the data from that code) :
<html itemscope itemtype="http://webmaster.yandex.ru/vocabularies/term-def.xml">
<meta itemprop="term" content="My term" />
<meta itemprop="definition" content="My definition" />
Just to mention I was interested, though not happy with these close discussions :
https://webmasters.stackexchange.com/questions/55073/what-meta-tag-or-structured-data-should-i-use-for-a-dictionary-web-application
schema.org and an online dictionary
Yandex is Russia's version of Google, and typically they both recognize and honor each other's search engine result implementations.
These articles you are referencing are incredibly outdated; I recommend you seeking out fresher sources, preferably where the term being defined uses the proper HTML element.
Here's the Yandex URL that is 404ing, the Wayback Machine is your friend!
Back to fresher documentation/resources, in this case the correct element as of 2016-10-05 is the <dfn> element. I know you want added semantics, but semantics is the proper place to start, and I'd follow that up by marking the entire dictionary up within a Definition List element, and placing the definition wrapped in the definition element into the <dt>, and the definition's of the term in the corresponding <dd>s.
I wouldn't waste time trying to find the perfect ontology here; implement [rel="tag" Microformat on all of the definitions], you can always come back and add a more desired one.
I've written a blog post about this, but a much more valuable resource is HTML5 Doctor's Glossary impementation, More importantly, view source - view-source:http://html5doctor.com/element-index/ (why stackoverflow doesn't recognize 'view-source' schema is beyond me)
More References/Resources:
Microformats Definition Examples has some very interesting ideas/code snippets
Utilizing the Underused by Semantically Awesome Definition List - Written Prior to HTML5's Redefinition of <dl> but Relevant

Getting HTML table values using XPath in Nokogiri?

I'm trying to get some values from a table using the XPath of this table but it only returns [] (empty):
require 'nokogiri'
require 'open-uri'
url = "http://riopretrans.com.br/linhas.php?ln=106"
doc = Nokogiri::HTML(open(url))
doc.xpath("html/body/table[1]/tbody/tr[2]/td/table/tbody/tr/td/table/tbody/tr[2]/td/div/table[1]/tbody/tr[3]/td/div/div/center/font/table").each do |lines|
puts lines.content
end
I found the table's XPath using Firebug so I think it's correct.
Can anyone help me?
Remove tbody/ from your XPath.
The tbody tag is part of the HTML spec for table tags, but it's rarely actually implemented in the HTML. Some browsers insert it, though it's not in the HTML for the page. Firebug then sees it, which you see, and think it must be so.
Even using "view source" can confuse you, because you expect that to be accurate, but the browser has already munged the content to include "tbody", so, well, basically they're lying to you.
You can confirm this by looking at the HTML that Nokogiri is getting. Use puts doc.to_html['tbody'] and see if you get "tbody" or nil.
...Because in html file all of them were specified(written by programmer)
If you are positive they actually belong there, because they exist in the HTML source, then you'll need to take apart your XPath. Start with a broad path, and slowly add to it to narrow down your search.
The server is unreachable for me right now, so I can't confirm that, or dig into what the hierarchy should be, and show an example. (That's why actually giving us REAL HTML in your question is SO much better than a link which might not work.)
An alternate is to use XPath's // (search anywhere) with a less restrictive path, or CSS selectors. Either way, actually examine the HTML, instead of relying on Firebug's XPath, and determine what "landmarks" you can use in the source to navigate to your desired table. Today's HTML is chock-full of id and class parameters, or a particular series of tags that act as a finger-print for the table you want. Search for the minimum needed to pin-point that table.
If the table is something like <table id="foo">, then use doc.at('table#foo'). If it's in a <div class="bar"><table> use doc.at('div.bar table'). In any case, use the smallest sized accessor necessary to get the job done. That will increase your chances of success if anything in the HTML changes in the future.

Internalizating content heavy pages into messagefiles seem cumbersome in play2?

Internalization in Play2 can be done with Message.get("home.title") and language files. What about when you internalizate a page full of textual content and not just one specific header or link?
For example doing Messagefile for a long page representing e.g. product info:
_First header_
Some paragraphs of text
...
_Tenth header_
Tenth paragraph and more text*
Messagefile
a)
product.info = "<many paragraphs of text including headers>"
or splitting one page into html elements
b)
product.info.h1 = "<first header>"
product.info.p1 = "<first para>"
product.info.p2 = "<2nd para>"
For me both solutions doesn't sound right. In first having a vast value for a single key seems bad convention and in latter separating a single page into dozens of keys doesn't sound good either.
Big websites often follow the convention www.site.com/en-us/product/1 of having the language in the URL. So the question is, how do i do in this way and is doing in this way a better way at all? I could easily end up not just translating to dozen languages but doing also dozen times layout changes.
I could use global codesnippets using Messagefile for elements that have a little text and doesn't change often e.g. navigation /view/global/header/somenavbar.scala.html but then i end up only having a complex folder structure.
Another way, a best practise, in Play 2 for internalization than messagefile?
Take a look to the Joscha Feth's solution in play_authenticate Java sample.
There are templates for emails in 3 languages for email confirmation, password reseting etc.
Template for each 'type' of email && each language is kept in single file ie:
_password_reset_en.scala.html
_password_reset_de.scala.html
_password_reset_pl.scala.html
_verify_email_en... etc
And for each 'type' there is an 'parent' template, which contains a condition (common Scala's match check the Tags section of template doc) which returns rendered view depending on detected language:
password_reset.scala.html
Finally, yes, at the beginning I also thought that some kind of madness, but believe me, that technique can be useful. There's field for further improvements I think. Maybe it would be better to move the language conditioning to the controller, hm I think that depends on many factors and it will be great if you'll find a time to investigate this topic.

HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?

I'm using HPricot's css search to identify a table within a web page. Here's a sample html snippet I'm parsing:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
There are lots of tables in the page. I want to find the table which contains the A Name=a1 reference.
Right now, the way I'm doing it is
(page/"a[#name=a1]")[0].parent.parent.parent.parent.parent
I don't like this because
It is ugly
It is error prone (what if the folks who maintain the web page remove the tbody?)
Is there a way to tell hpricot to get me the table ancestor of the specified element?
Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm
The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.
Rohith is right. It is ugly and it is error prone (more than it needs to be). Again as he says it is much more clear with the intent to say "find the closest parent that is a table", and this could go for any child/parent relationship.
If it's "not possible" to do that with hpricot then just say so. But don't just say "it's hopeless to try to do that anyway what's the point". That's a bogus answer. It also doesn't help the next person who comes along (myself) looking for the answer to the same question but for different reasons, which is parsing many pages where differences are ASSUMED and not just feared.
To actually answer the question... I don't know, yet. And I don't have much hope of finding out with hpricot. The documentation is absolutely horridly nonexistent.
But here's a workaround that does about the same thing.
table = (page%"a[#name=a1]").parent
table = table.parent while table.name != "table"
Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the % (AKA "at") instead of / (AKA search). % returns only the first occurrence so you can drop the [0] index.
(page%"a[#name=a1]").parent.parent.parent.parent.parent
or
page%'//a[#name="a1"]/../../../../../..'
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
page%'//table[#height=61 and #width=700]'
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil
Basically the XPath pattern means:
Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.
Note: Firefox automatically adds the <tbody> tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.
The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3] according to Firefox so you have to strip the tbody. Also you don't need to anchor at /html.

Resources