Get computed font-size in Rails - ruby

I'm using Rails 3 to scrape a website, and doing a query like so:
agent = Mechanize.new
doc = agent.get(url)
I'm then doing
doc.search("//div")
Which returns a list of all divs on the page. I'd like to select the div that has the largest font size. Is there anyway to use Mechanize, Nokogiri, or any other Rails gem to find the computed font-size of a div, and from there, choose the one with the largest font size?
Thanks

You can't do this with Mechanize or Nokogiri, because they simply read the static HTML. Yet font size isn't usually defined in HTML anymore; it is generally defined in CSS or added programmatically using JavaScript.
The only solution is to be able to execute JavaScript and use JavaScript's getComputedStyle method which can get the font size that has been applied to an element (via either CSS or JS). So you need a way to inject JS into your pages and get a result. This may be possible using watir-webdriver, because Selenium has hooks to do this. See the very end of this page for instructions on how to inject JS and return a result back to the caller in Selenium. Another option is PhantomJS which is a headless browser with a JS API.

Related

I need to scrape that number xpath + aspx

I'm trying to get the total number of the page (or the page value) of this URL:
http://scorelibrary.fabermusic.com/String-Quartet-No-2-300-Weihnachtslieder-23032.aspx
1/58
I think that I can't because the values are inside in the ASPX frame.
I'll try a lot of thing. This is the line:
<label id="page_count">1/58</label>
using the following XPath
//label[#id='page_count']/text()
How can I use XPath inside the ASPX frame to get the page value?
You are right, you cannot get that value directly because the element is in an <iframe>, and therefore lives in a different context. You need to activate or switch to the URL context of the iframe. There are JavaScript approaches like postMessage, but I think the easiest way is loading the URL of the iframe directly and access the DOM from there.

Scrolling down custom scroller with Selenium for Ruby

I need to access some data that is shown after scrolling a custom scroll bar inside a website. (Not the general scrolling function)
Selenium seems to be unable to locate it without performing such action first.
I have checked similar replies but all of them teach you how to scroll down the page and not a bar inside the UI, or they provide solutions for other languages like Python.
Is it possible to do this with the selenium-webdriver for Ruby?
This is the website: http://www.lamiecaline.com/fr/magasins?address=&city=70
The elements I want to access are on the left side, Selenium is only able to access the first 4 elements by default.
You might want to try an execute_script with parameters.
Like so:
execute_script("arguments[0].scrollTop = arguments[1];", myElement, pixels)
You may have to import the PageObject gem.
myElement would be your mCSB_container div.

scrapy xpath select elements by classname

I have followed How can I find an element by CSS class with XPath? which gives the selector to use for selecting elements by classname. The problem is when I use it it retrieves an empty result "[]" and I know by fact there is a div classed "zoomWindow" in the url fed to the scrapy shell.
My attempt:
scrapy shell "http://www.niceicdirect.com/epages/NICShop.sf/secAlIVFGjzzf2/?ObjectPath=/Shops/NICShop/Products/5696"
response.xpath("//*[contains(#class, 'zoomWindow')]")
I have looked at many resources that provide varied selectors. In my case the element only has one class, so versions that use "concat" I used but didn't work and discarded.
I have installed ubuntu and scrapy in a virtual machine just to make sure it was not a bug in my installation on windows but my attempt on ubuntu had the same results.
I don't know what else to try, can you see any typo in the selector?
If you would check the response.body in the shell - you would see that it doesn't contain an element with class="zoomWindow":
In [3]: "zoomWindow" in response.body
Out[3]: False
But, if you open the page in the browser and inspect the HTML source, you would see that the element is there. This means that the page load involves javascript logic or additional AJAX requests. Scrapy is not a browser and doesn't have a javascript engine built-in. In other words, it only downloads the initial HTML code of the page without additionally downloading js and css files and "executing" them.
What you can try, for starters, is to use scrapyjs download handler and middleware.
To image you want to extract is also available in the img tag with id="PreviewImage":
In [4]: response.xpath("//img[#id='PreviewImage']/#src").extract()
Out[4]: [u'/WebRoot/NICEIC/Shops/NICShop/547F/0D9A/F434/5E4C/0759/0A0A/124C/58F7/5708.png']

click on xpath link with Mechanize

I want to click a link with Mechanize that I select with xpath (nokogiri).
How is that possible?
next_page = page.search "//div[#class='grid-dataset-pager']/span[#class='currentPage']/following-sibling::a[starts-with(#class, 'page')][1]"
next_page.click
The problem is that nokogiri element doesn't have click function.
I can't read the href (URL) and send get request because the link has onclick function defined (no href attribute).
If that's not possible, what are the alternatives?
Use page.at instead of page.search when you're trying to find only one element.
You can make your selector simpler (shorter) by using CSS selector syntax:
next_page = page.at('div.grid-dataset-pager > span.currentPage + a[class^="page"]')
You can construct your own Link instance if you have the Nokogiri element, page, and mechanize object to feed the constructor:
next_link = Mechanize::Page::Link.new( next_page, mech, page )
next_link.click
However, you might not need that, because Mechanize#click lets you supply a string with the text of the anchor/button to click on.
# Assuming this link text is unique on the page, which I suspect it is
mech.click next_page.text
Edit after re-reading the question completely: However, none of this is going to help you, because Mechanize is not a web browser! It does not have a JavaScript engine, and thus won't (can't) execute your onclick for you. For this you'll need to use Ruby to control a real web browser, e.g. using Watir or Selenium or Celerity or the like.
In general you would do:
page.link_with(:node => next_link).click
However like Phrogz says, this won't really do what you want.
Why don't you use a hpricot element instead? Mechanize can click on a hpricot element as long as the link has a 'src' or 'href' attribute. Try something along these lines:
page = agent.get("http://www.example.com")
next_page = agent.click((page/"//your/xpath/a"))
Edit After reading Phrogz answer I also realized that this won't really do it. Mechanize doesn't support Javascript yet. With this in mind you have 3 options.
Use a library that controls a real web browser. See #Phrogz answer.
Use Capybara which is an integration testing library but can also be used as a stand alone crawler. I've done this successfully with HTMLUnit which is a also an integration testing library in Java. Capybara comes with Selenium support by default though it also supports Webkit via an external gem. Capybara interprets Javascript out of the box. This blog post might help.
Grok the page that you intend to crawl and use something like HTTPFox to monitor what the onclick Javascript function does and replicate this in your Mechanize script.
Good luck.

How to extract website information using XPath inside firefox extension?

I have made a firefox extension which loads a web page using xmlhttprequest.
My extension has it's own window opened alongside the main Firefox.
The idea of my extension is to load a webpage in memory, modify it and publish in newly opened tab in firefox.
The webpage has a div with id "Content". And that's the div i want to modify. I have been using XPath alot in greaseMonkey scripts and so i wanted to use it in my extension, however, i have a problem. It seems it doesn't work as i would want. I always get the result of 0.
var pageContents = result.responseText; //webpage which was loaded via xmlhttprequest
var localDiv = document.createElement("div"); //div to keep webpage data
localDiv.innerHTML = pageContents;
// trying to evaluate and get the div i need
var rList = document.evaluate('//div[#id="content"]', localDiv, null XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
The result is always 0 as i said. Now i have created the local div to store website data because i cannot parse the text using XPath. And document in this case is my extensions XUL document/window.
I did expect it to work, but i was wrong.
I know how to extract the div using string.indexOf(str) and then slice(..). However, thats very slow and is not handy, because i need to modify the contents. Change the background, borders of the many forms inside this div. And for this job, i have not seen a better method than evaluating XPath to get all the nodes i need.
So main question is, how to use XPath to parse loaded web page in firefox extension?
Thank you
Why not load the page in a tab, then modify it in place, like Greasemonkey does?
As for your code, you don't say where it executes (i.e. what is document.location?), but assuming it runs in a XUL window, it makes no sense -- document.createElement will not create an HTML element (but a XUL div element, which has no special meaning), innerHTML shouldn't work for such element, etc.

Resources