There are some data on this page :
$ scrapy shell "https://partsouq.com/en/catalog/genuine/unit?c=Toyota&ssd=%24HQwdcgcAAwFNa3YjVR92aVB7C10ZDko%24&vid=4463&cid=&uid=2535&q="
and there are numbers on the left hand-side of the page, After clicking on any one of them a table with contents appears like in the attachement, but after making "inspect element" on any item on this table, i get empty set !!
response.xpath('//*[#id="gf-result-table"]/tr[2]/td[2]/div').extract()
[ ]
this shows the tabe and the html code for it
You are giving wrong xpath. correct xpath is
response.xpath('//*[#id="gf-result-table"]/tbody/tr[2]/td[2]/div')
https://partsouq.com/en/search/search?q=0910112012&qty=1
this is the url of the attachement, the pop-up window is rendered by JavaScript, you can not do JS things in the scrapy.
And the xpath for the a tag is simple:
//a[#id]
Related
So I would go to an instagram account, say, https://www.instagram.com/foodie/ to copy its xpath that gives me the number of posts, number of followers, and number of following.
I would then run the command this command on a scrapy shell:
response.xpath('//*[#id="react-root"]/section/main/article/header/section/ul')
to grab the elements on that list but scrapy keeps returning an empty list. Any thoughts on what I'm doing wrong here? Thanks in advance!
This site is a Single Page Application (SPA) so it's javascript that render DOM is not rendered yet at the time your downloader working.
When you use view(response) the javascript that your downloader collected can continue render by your browser, so you can see the page with DOM rendered (but can't interacting with Site API). You can look at your downloaded content via response.text and saw that!
In this case, you can apply selenium + phantomjs to making a rendered page for your spider!
Another trick: You can use regular expression to select the JSON part of Script, parse it to JSON obj and select your correspond attribute value (number of post, following, ...) from script!
I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!
The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".
I'm using Chrome Data Miner, and so far, failing to extract the data from my query: http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free
How to code the Next Element Xpath for this website? I tried all the possible web sources, nothing worked.
Thanks in advance!
You could look for a tags (//a) whose descendant::text() starts with "Next" and then get the href attribute of that a element.
% xpquery -p HTML '//a[starts-with(descendant::text(), "Next")]/#href' 'http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free'
href="http://www.allinlondon.co.uk/restaurants.php?type=name&tube=0&rest=glutenfree®ion=0&cuisine=0&start=30&ordering=&expand="
I have the following HTML:
<input type="button" value="Close List" class="tiny round success button" id="btnSaveCloseListPanel">
The following code does not work:
# browser.button(:value => "Close List").click # does not work - timeout
browser.button(:xpath => "/html/body/center/div/div[9]/div[2]/input[2]").when_present.click
The error is:
Watir::Wait::TimeoutError:
timed out after 60 seconds
when_present(300) does not work.
I found the XPath using Firefox Developer Tools. I used the complete path to avoid any silly errors. I can find the same path manually in IE.
The component is a .NET MVC popup. I think it's called a "panel". The panel is a grandchild of the Internet Explorer tab.
The panel contains a datepicker, a dropdown, a text box, and 3 buttons. I can't find any of these using Watir. I can find anything in the panel's parent (obviously).
The underlying code does not seem to be aware that something actually doesn't exist. To prove that, I tested the following XPath, which is simply the above XPath with the middle bit removed:
browser.button(:xpath => "/html/body/center/div/input[2]").when_present.click
The error is "timeout", rather than "doesn't exist".
So, the code seems to be unaware that:
input[1] does not exist, therefore input[2] cannot exist.
div[2] does not exist.
Therefore there's nothing left to search.
Added:
I'm changing the specific element that I want to find.
Reason: The button in my OP was at the foot of the panel. I was going cross-eyed trying to step upwards through hundreds of lines of HTML. Instead, I'm now using the first field in the panel. All the previous info is still the same.
The first field is a text field with datepicker.
The HTML is:
<input type="text" value="" style="width:82px!important;" readonly="readonly" name="ListDateClosed" id="ListDateClosed" class="hasDatepicker">
Using F12 in Firefox, the XPath is:
/html/body/center/div/div[1]/div[2]/input
But, now, with a lot less lines of HTML, I can clearly see that the html tag is not the topmost html tag in the file. The parent of html is iframe
I've never used iframe before. Maybe this is what t0mppa was referring to in his comment to the first questiion.
As an experiment, I modified my XPath to:
browser.text_field(:xpath, '//iframe/html/body/center/div/div[1]/div[2]/input').when_present.set("01-Aug-2014")
But this times out, even with a 3-minute timeout.
Given that the elements are in an iframe, there are two things to note:
Unlike other elements types, you must always tell Watir when an element is in an iframe.
XPaths (in the context of Watir) cannot be used to cross into frames.
Assuming that there is only 1 iframe on the page, you can explicitly tell Watir to search the first iframe by using the iframe method:
browser.iframe.text_field(:xpath, '//body/center/div/div[1]/div[2]/input').when_present.set("01-Aug-2014")
If there are multiple iframes, you can use the usual locators to be more specific about which iframe. For example, if the iframe had an id:
browser.iframe(id: 'iframe_id')
.text_field(xpath: '//body/center/div/div[1]/div[2]/input')
.when_present
.set("01-Aug-2014")
I am new to XPath. I have a html source of the webpage
http://london.craigslist.co.uk/com/1233708939.html
Now I want to extract the following data from the above page
Full Date
Email - just below the date
I also want to find the existence of the button "Reply to this post" on the page
http://sfbay.craigslist.org/sfc/w4w/1391399758.html
Can anyone help me in writing the three XPath expressions for the above three data.
You don't need to write these yourself, or even figure them out yourself. If you use the Firebug plugin, go to the page, right click on the elements you want, click 'Inspect element' and Firebug will popup the HTML in a viewer at the bottom of your browser. Right click on the desired element in the HTML viewer and click on 'Copy XPath'.
That said, the XPath expression you're looking for (for #3) is:
/html/body/div[4]/form/button
...obtained via the method described above.
I noticed that the DTD is HTML 4/01 Transitional and not XHTML for the first link, so there's no guarantee that this is a valid XML document, and it may not be loaded correctly by an XML parser. In fact, I see several tags that aren't properly closed (i.e. <hr>, etc)
I don't know the first one off hand, and the third one was just answered by Alex, but the second one is /html/body/a[0].
As of your first page it's just impossible to do because this is not the way xpath works. In order for an xpath expression to select something that "something" must be a node (ie an element)
The second page is fairly easy, but you need an "id" attribute in order to do that (or anything that can make sure your button is unique). For example if you are sure the text "Reply to this post" correctly identify the button just do it with
//button["Reply to this post"]