Having a hard time with constructing the proper xpath for this page.
I'm trying to build the spider (have done before few times, so I know the basics) for parsing the names, urls and prices for products but any xpath I tried in scrapy shell doesn't go past 6th item.
last xpath I tried was sel.xpath('//table[#width="1000"]') and yet again, it finished parsing items on Huawei Ascend G700
Can anybody tell me what's wrong?
You should try this xpath:
//table[#width='1000']//td/table[#class='probg']
to locate all product elements on this page.
Related
finally decided to sign up to stackoverflow because of this. So I´d be super grateful about a solution!
I´m trying to get a number of a <span> element. Here is an image of the data box I´m trying to scrape from. It´s on this page: https://de.marketscreener.com/kurs/aktie/SNOWFLAKE-INC-112440376/analystenerwartungen/
The relevant Xpath is //*[#id="highcharts-0oywbsk-200"]/div[2]/div/span/span
I´m trying: =IMPORTXML("https://de.marketscreener.com/kurs/aktie/SNOWFLAKE-INC-112440376/analystenerwartungen/"),"//div[2]/div/span/span")
I´m ignoring the #id-element, this works pretty well with many elements on the same page, but in this case not at all. I ignore the id, because I can´t use it as it changes on every page. Is this ok?
Google Sheets always gives me a #N/A error?! Any idea how to scrape that number?
disabling JavaScript reveals what you can scrape:
There is a google sheet containing a list of MPN's (manufacturer part numbers). Trying to scrape a site called wikiarms for the UPC Codes when I have the MPN for an item.
I have the correct formula for doing this on another site.
=IMPORTXML("http://gun.deals/search/apachesolr_search/"&B1,"//dd/a[../../dt[contains(text(),'UPC')]]|//dd/span[../../dt[contains(text(),'UPC')]]")
Trying to figure out what the correct xpath to complete this formula. Some videos I have watch said to open the page in Chrome and use inspector to select and copy the xpath to complete the importxml function. I tried this with no luck.
Sample
Visit https://www.wikiarms.com/guns?q=20071
In the table there is a button "available in 6 stores" click that to reveal the list. The UPC should be listed after the MPN.
If I copy the xpath in Chrome this is the result
/html/body/div[1]/div/div/div[2]/div/div/div[2]/div[2]/table/tbody/tr[2]/td[5]
=IMPORTXML("https://www.wikiarms.com/guns?q="&B2,"xpath here")
What do I have to add at the end of this formula to pull in the UPC code? I will be using this formula to pull in UPC code for about 1000 items.
Thank you for your help.
Using your sample link, try
=IMPORTXML("https://www.wikiarms.com/guns?q=20071","//td[#class='upc']/a/#title")
and see if it works for you.
I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!
The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".
I'm using Chrome Data Miner, and so far, failing to extract the data from my query: http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free
How to code the Next Element Xpath for this website? I tried all the possible web sources, nothing worked.
Thanks in advance!
You could look for a tags (//a) whose descendant::text() starts with "Next" and then get the href attribute of that a element.
% xpquery -p HTML '//a[starts-with(descendant::text(), "Next")]/#href' 'http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free'
href="http://www.allinlondon.co.uk/restaurants.php?type=name&tube=0&rest=glutenfree®ion=0&cuisine=0&start=30&ordering=&expand="
I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?
You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.
The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.