Scrape website with Ruby based on embedded CSS styles - ruby

In the past, I have successfully used Nokogiri to scrape websites using a simple Ruby script. For a current project, I need to scrape a website that only uses inline CSS. As you can imagine, it is an old website.
What possibilities do I have to target specific elements on the page based on the inline CSS of the elements? It seems this is not possible with Nokogiri or have I overlooked something?
UPDATE: An example can be found here. I basically need the main content without the footnotes. The latter have a smaller font size and are grouped below each section.

I'm going to teach you how to fish. Instead of trying to find what I want, it's sometimes a lot easier to find what I don't want and remove it.
Start with this code:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
FOOTNOTE_ACCESSORS = [
'span[style*="font-size: 8.0pt"]',
'span[style*="font-size:8.0pt"]',
'span[style*="font-size: 7.5pt"]',
'span[style*="font-size:7.5pt"]',
'font[size="1"]'
].join(',')
doc = Nokogiri.HTML(open(URL))
doc.search(FOOTNOTE_ACCESSORS).each do |footnote|
footnote.remove
end
File.write(File.basename(URI.parse(URL).path), doc.to_html)
Run it, then open the resulting HTML file in your browser. Scroll through the file looking for footnotes you want to remove. Select part of their text, then use "Inspect Element", or whatever tool you have that will find that selected text in the source of the page. Find something unique in that text that makes it possible to isolate it from the text you want to keep. For instance, I locate footnotes using the font-sizes in <span> and <font> tags.
Keep adding accessors to the FOOTNOTE_ACCESSORS array until you have all undesirable elements removed.
This code isn't complete, nor is it written as tightly as I'd normally do it for this sort of task, but it will give you an idea how to go about this particular task.
This is a version that is a bit more flexible:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
FOOTNOTE_ACCESSORS = [
'span[style*="font-size: 8.0pt"]',
'span[style*="font-size:8.0pt"]',
'span[style*="font-size: 7.5pt"]',
'span[style*="font-size:7.5pt"]',
'font[size="1"]',
]
doc = Nokogiri.HTML(open(URL))
FOOTNOTE_ACCESSORS.each do |accessor|
doc.search(accessor).each do |footnote|
footnote.remove
end
end
File.write(File.basename(URI.parse(URL).path), doc.to_html)
The major difference is the previous version assumed all entries in FOOTNOTE_ACCESSORS were CSS. With this change XPath can also be used. The code will take a little bit longer to run as the entries are iterated over, but the ability to dig in with XPath might make it worthwhile for you.

You can do something like:
doc.css('*[style*="foo"]')
That will select any element with foo appearing anywhere in it's style attribute.

Related

Cannot narrow down an xpath using mechanize in ruby to extract a table

I'm trying to get the data contained after this highlighted td tag:
(screenshot taken from firefox developer tools)
but I don't understand how I'm going to get there. I tried to use xpath
page.parser.xpath("//table//tbody//tr//td//ul//form//table//tbody//tr/td")
but this doesn't work and I assuming it's because I'm not identifying anything? I am not sure how I'm going to identify stuff though, since some of them have no ids or names. So the question is how do I reach this tag.
Probably because the tbody isn't really there. There's a lot of discussion on SO about that particular issue.
Here's how what I would probably do:
td = page.at '#tF0 .txt2'

Passing Markdown Content to Ruby Function With Jekyll/Liquid

I am trying to write a jekyll plugin that will take a normal markdown file and provide some extra functionality on top of it. In particular, I need to do some (not actually) fancy things with tables. I know you can write straight HTML into a markdown file, but there is a requirement that the content folks don't want to / can't edit HTML.
As an extra wrench in the works, the mobile layout has a UX requirement that I essentially have to render the table as a group of divs as opposed to a table.
My initial thought was to pass the {{page.content}} variable to a ruby function extending Liquid::Tag. From there I was planning on parsing the markdown file and either:
1. If normal non-table markdown, use as normal
2. If table markdown, look for custom identifier in markdown, do what needs to be done (e.g. add class, etc)
If I do something like this:
def render(context)
content = Liquid::Template.parse(#markup).render context
end
It renders the context as a normal markdown file. However, I want to break up the context variable and work with the pieces before rendering. I've tried a few different approaches that I've gotten from the jekyll docs and Stack Overflow and gotten nowhere.
Does anyone have any experience with this? I am heading down the right path? For what it's worth, Ruby/Jekyll/Liquid is fairly new to me, so if you think I may have missed something fairly basic and obvious then please let me know.
A markdown table tool for editors !
markdownify your table in http://www.tablesgenerator.com/markdown_tables
paste the markdown result in http://prose.io/
done
I don't know other way to simplify editor's work on Jekyll, but I'll be very interested in earing from your project. Good luck.

extract xpath

I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?
You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.
The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm
First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.
There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.
I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

How to implement Watir classes (e.g. PageContainer)?

I'm writing a sample test with Watir where I navigate around a site with the IE class, issue queries, etc..
That works perfectly.
I want to continue by using PageContainer's methods on the last page I landed on.
For instance, using its HTML method on that page.
Now I'm new to Ruby and just started learning it for Watir.
I tried asking this question on OpenQA, but for some reason the Watir section is restricted to normal members.
Thanks for looking at my question.
edit: here is a simple example
require "rubygems"
require "watir"
test_site = "http://wiki.openqa.org/"
browser = Watir::IE.new
browser.goto(test_site)
# now if I want to get the HTML source of this page, I can't use the IE class
# because it doesn't have a method which supports that
# the PageContainer class, does have a method that supports that
# I'll continue what I want to do in pseudo code
Store HTML source in text file
# I know how to write to a file, so that's not a problem;
# retrieving the HTML is the problem.
# more specifically, using another Watir class is the problem.
Close browser
# end
Currently, the best place to get answers to your Watir questions is the Watir-General email list.
For this question, it would be nice to see more code. Is the application under test (AUT) opening a new window/tab that you were having trouble getting to and therefore wanted to try the PageContainer, or is it just navigating to a second page?
If it is the first one, you want to look at #attach, if it is the second, then I would recommend reading the quick start tutorial.
Edit after code added above:
What I think you missed is that Watir::IE includes the Watir::PageContainer module. So you can call browser.html to get the html displayed on the page to which you've navigated.
I agree. It seems to me that browser.html is what you want.

Resources