Confused about scrapy and Xpath - xpath

I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!

The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".

Related

Copying the xpath from Instagram inspect (using chrome) returns an empty list

So I would go to an instagram account, say, https://www.instagram.com/foodie/ to copy its xpath that gives me the number of posts, number of followers, and number of following.
I would then run the command this command on a scrapy shell:
response.xpath('//*[#id="react-root"]/section/main/article/header/section/ul')
to grab the elements on that list but scrapy keeps returning an empty list. Any thoughts on what I'm doing wrong here? Thanks in advance!
This site is a Single Page Application (SPA) so it's javascript that render DOM is not rendered yet at the time your downloader working.
When you use view(response) the javascript that your downloader collected can continue render by your browser, so you can see the page with DOM rendered (but can't interacting with Site API). You can look at your downloaded content via response.text and saw that!
In this case, you can apply selenium + phantomjs to making a rendered page for your spider!
Another trick: You can use regular expression to select the JSON part of Script, parse it to JSON obj and select your correspond attribute value (number of post, following, ...) from script!

How to expand elements with ajax using PHP Simple HTML DOM parser?

I use Google Chrome code inspect utility at the page http://7pay.in/to_phone?phone=9101235577 when I scroll down to Bitcoin icon, something happening in ajax: I don't know how ajax works at all, but I really want to parse the Bitcoin address which is not represented as HTML code when you fetch the link given above, until you press Bitcoin button (Its not represented when I curl or wget the page)
Could you please give me a hint on how to work with that kind of ajax elements. I understand logically that by default some parts of page are hidden by Javascript, but I can't quite understand what I should do in order to work with it using PHP Simple DOM parser.

scrapy xpath select elements by classname

I have followed How can I find an element by CSS class with XPath? which gives the selector to use for selecting elements by classname. The problem is when I use it it retrieves an empty result "[]" and I know by fact there is a div classed "zoomWindow" in the url fed to the scrapy shell.
My attempt:
scrapy shell "http://www.niceicdirect.com/epages/NICShop.sf/secAlIVFGjzzf2/?ObjectPath=/Shops/NICShop/Products/5696"
response.xpath("//*[contains(#class, 'zoomWindow')]")
I have looked at many resources that provide varied selectors. In my case the element only has one class, so versions that use "concat" I used but didn't work and discarded.
I have installed ubuntu and scrapy in a virtual machine just to make sure it was not a bug in my installation on windows but my attempt on ubuntu had the same results.
I don't know what else to try, can you see any typo in the selector?
If you would check the response.body in the shell - you would see that it doesn't contain an element with class="zoomWindow":
In [3]: "zoomWindow" in response.body
Out[3]: False
But, if you open the page in the browser and inspect the HTML source, you would see that the element is there. This means that the page load involves javascript logic or additional AJAX requests. Scrapy is not a browser and doesn't have a javascript engine built-in. In other words, it only downloads the initial HTML code of the page without additionally downloading js and css files and "executing" them.
What you can try, for starters, is to use scrapyjs download handler and middleware.
To image you want to extract is also available in the img tag with id="PreviewImage":
In [4]: response.xpath("//img[#id='PreviewImage']/#src").extract()
Out[4]: [u'/WebRoot/NICEIC/Shops/NICShop/547F/0D9A/F434/5E4C/0759/0A0A/124C/58F7/5708.png']

Programmatically Open Link in WebView

I have an NS Window with a WebView.
My program takes in a search query and executes a Google search with it, the results being displayed in the WebView, like a browser.
Instead of displaying the search results in the WebView, I'd like to automatically open the first link and display the contents of that result instead.
As a better example, how do I display the contents of the first result of Google in a WebView?
Is this even possible?
Any help greatly appreciated. Thanks!
You could use the direct Google Search API. That would be more convinient.
https://developers.google.com/custom-search/v1/cse/list?hl=de-DE
Also you could also try to make a google request like the "I'm feeling lucky" button, which will direct you automatically to the first search result.
If you have to parse the HTML, you need to have a look at the HTML structure of the google result page. Look for specific id and class css properties in the div and a tags. If you found the ones, where the actual results are you can start parsing that content. Also i guess it would be easier to put some javascript together, that will find the first result and open it. (More easier than parsing the HTML using obj-c). You can evaluate javascript in the webview using [myWebView stringByEvaluatingJavaScriptFromString: #"put your js code here"].
Sure it is possible.
The first way to accomplish that that goes through my head is to parse the HTML response from Google, then launch a WebView with the first link you extracted.
Take a look at regular expressions to make it easy.

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm
First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.
There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.
I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Resources