I have followed How can I find an element by CSS class with XPath? which gives the selector to use for selecting elements by classname. The problem is when I use it it retrieves an empty result "[]" and I know by fact there is a div classed "zoomWindow" in the url fed to the scrapy shell.
My attempt:
scrapy shell "http://www.niceicdirect.com/epages/NICShop.sf/secAlIVFGjzzf2/?ObjectPath=/Shops/NICShop/Products/5696"
response.xpath("//*[contains(#class, 'zoomWindow')]")
I have looked at many resources that provide varied selectors. In my case the element only has one class, so versions that use "concat" I used but didn't work and discarded.
I have installed ubuntu and scrapy in a virtual machine just to make sure it was not a bug in my installation on windows but my attempt on ubuntu had the same results.
I don't know what else to try, can you see any typo in the selector?
If you would check the response.body in the shell - you would see that it doesn't contain an element with class="zoomWindow":
In [3]: "zoomWindow" in response.body
Out[3]: False
But, if you open the page in the browser and inspect the HTML source, you would see that the element is there. This means that the page load involves javascript logic or additional AJAX requests. Scrapy is not a browser and doesn't have a javascript engine built-in. In other words, it only downloads the initial HTML code of the page without additionally downloading js and css files and "executing" them.
What you can try, for starters, is to use scrapyjs download handler and middleware.
To image you want to extract is also available in the img tag with id="PreviewImage":
In [4]: response.xpath("//img[#id='PreviewImage']/#src").extract()
Out[4]: [u'/WebRoot/NICEIC/Shops/NICShop/547F/0D9A/F434/5E4C/0759/0A0A/124C/58F7/5708.png']
Related
So I would go to an instagram account, say, https://www.instagram.com/foodie/ to copy its xpath that gives me the number of posts, number of followers, and number of following.
I would then run the command this command on a scrapy shell:
response.xpath('//*[#id="react-root"]/section/main/article/header/section/ul')
to grab the elements on that list but scrapy keeps returning an empty list. Any thoughts on what I'm doing wrong here? Thanks in advance!
This site is a Single Page Application (SPA) so it's javascript that render DOM is not rendered yet at the time your downloader working.
When you use view(response) the javascript that your downloader collected can continue render by your browser, so you can see the page with DOM rendered (but can't interacting with Site API). You can look at your downloaded content via response.text and saw that!
In this case, you can apply selenium + phantomjs to making a rendered page for your spider!
Another trick: You can use regular expression to select the JSON part of Script, parse it to JSON obj and select your correspond attribute value (number of post, following, ...) from script!
I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!
The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".
I'm using Rails 3 to scrape a website, and doing a query like so:
agent = Mechanize.new
doc = agent.get(url)
I'm then doing
doc.search("//div")
Which returns a list of all divs on the page. I'd like to select the div that has the largest font size. Is there anyway to use Mechanize, Nokogiri, or any other Rails gem to find the computed font-size of a div, and from there, choose the one with the largest font size?
Thanks
You can't do this with Mechanize or Nokogiri, because they simply read the static HTML. Yet font size isn't usually defined in HTML anymore; it is generally defined in CSS or added programmatically using JavaScript.
The only solution is to be able to execute JavaScript and use JavaScript's getComputedStyle method which can get the font size that has been applied to an element (via either CSS or JS). So you need a way to inject JS into your pages and get a result. This may be possible using watir-webdriver, because Selenium has hooks to do this. See the very end of this page for instructions on how to inject JS and return a result back to the caller in Selenium. Another option is PhantomJS which is a headless browser with a JS API.
I want to click a link with Mechanize that I select with xpath (nokogiri).
How is that possible?
next_page = page.search "//div[#class='grid-dataset-pager']/span[#class='currentPage']/following-sibling::a[starts-with(#class, 'page')][1]"
next_page.click
The problem is that nokogiri element doesn't have click function.
I can't read the href (URL) and send get request because the link has onclick function defined (no href attribute).
If that's not possible, what are the alternatives?
Use page.at instead of page.search when you're trying to find only one element.
You can make your selector simpler (shorter) by using CSS selector syntax:
next_page = page.at('div.grid-dataset-pager > span.currentPage + a[class^="page"]')
You can construct your own Link instance if you have the Nokogiri element, page, and mechanize object to feed the constructor:
next_link = Mechanize::Page::Link.new( next_page, mech, page )
next_link.click
However, you might not need that, because Mechanize#click lets you supply a string with the text of the anchor/button to click on.
# Assuming this link text is unique on the page, which I suspect it is
mech.click next_page.text
Edit after re-reading the question completely: However, none of this is going to help you, because Mechanize is not a web browser! It does not have a JavaScript engine, and thus won't (can't) execute your onclick for you. For this you'll need to use Ruby to control a real web browser, e.g. using Watir or Selenium or Celerity or the like.
In general you would do:
page.link_with(:node => next_link).click
However like Phrogz says, this won't really do what you want.
Why don't you use a hpricot element instead? Mechanize can click on a hpricot element as long as the link has a 'src' or 'href' attribute. Try something along these lines:
page = agent.get("http://www.example.com")
next_page = agent.click((page/"//your/xpath/a"))
Edit After reading Phrogz answer I also realized that this won't really do it. Mechanize doesn't support Javascript yet. With this in mind you have 3 options.
Use a library that controls a real web browser. See #Phrogz answer.
Use Capybara which is an integration testing library but can also be used as a stand alone crawler. I've done this successfully with HTMLUnit which is a also an integration testing library in Java. Capybara comes with Selenium support by default though it also supports Webkit via an external gem. Capybara interprets Javascript out of the box. This blog post might help.
Grok the page that you intend to crawl and use something like HTTPFox to monitor what the onclick Javascript function does and replicate this in your Mechanize script.
Good luck.
I have made a firefox extension which loads a web page using xmlhttprequest.
My extension has it's own window opened alongside the main Firefox.
The idea of my extension is to load a webpage in memory, modify it and publish in newly opened tab in firefox.
The webpage has a div with id "Content". And that's the div i want to modify. I have been using XPath alot in greaseMonkey scripts and so i wanted to use it in my extension, however, i have a problem. It seems it doesn't work as i would want. I always get the result of 0.
var pageContents = result.responseText; //webpage which was loaded via xmlhttprequest
var localDiv = document.createElement("div"); //div to keep webpage data
localDiv.innerHTML = pageContents;
// trying to evaluate and get the div i need
var rList = document.evaluate('//div[#id="content"]', localDiv, null XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
The result is always 0 as i said. Now i have created the local div to store website data because i cannot parse the text using XPath. And document in this case is my extensions XUL document/window.
I did expect it to work, but i was wrong.
I know how to extract the div using string.indexOf(str) and then slice(..). However, thats very slow and is not handy, because i need to modify the contents. Change the background, borders of the many forms inside this div. And for this job, i have not seen a better method than evaluating XPath to get all the nodes i need.
So main question is, how to use XPath to parse loaded web page in firefox extension?
Thank you
Why not load the page in a tab, then modify it in place, like Greasemonkey does?
As for your code, you don't say where it executes (i.e. what is document.location?), but assuming it runs in a XUL window, it makes no sense -- document.createElement will not create an HTML element (but a XUL div element, which has no special meaning), innerHTML shouldn't work for such element, etc.