xpath extract URL - Scrapy

xpath extract URL - Scrapy - xpath

I am trying to scrape the following website: https://bionetz.ch/adressen/detailhandel/bio-fachgeschaefte.html
At the end of my scraper, I would like to integrate a for-loop which automatically goes to the next page.
Of course, I know that there is a "show all-button" which I used for my solution. However, when exploring this website, I wasnt able to extract the href-tag from the website.
The href- tag I need should be in the following "li"
<a title="Weiter" href="/adressen/detailhandel/bio-fachgeschaefte/page2.html" class="pagenav"><span class="fa fa-angle-right"></span></a>
However, I wasnt able to get it? What would be the xpath to extract it?

You can use scrapy shell for debugging. https://docs.scrapy.org/en/latest/topics/debug.html
scrapy shell https://bionetz.ch/adressen/detailhandel/bio-fachgeschaefte.html
Then we can extract next URL
>>> response.xpath("//a[#title='Weiter']/#href").get()
'/adressen/detailhandel/bio-fachgeschaefte/page2.html'

Related

Import XML of span element fails with #N/A

finally decided to sign up to stackoverflow because of this. So I´d be super grateful about a solution!
I´m trying to get a number of a <span> element. Here is an image of the data box I´m trying to scrape from. It´s on this page: https://de.marketscreener.com/kurs/aktie/SNOWFLAKE-INC-112440376/analystenerwartungen/
The relevant Xpath is //*[#id="highcharts-0oywbsk-200"]/div[2]/div/span/span
I´m trying: =IMPORTXML("https://de.marketscreener.com/kurs/aktie/SNOWFLAKE-INC-112440376/analystenerwartungen/"),"//div[2]/div/span/span")
I´m ignoring the #id-element, this works pretty well with many elements on the same page, but in this case not at all. I ignore the id, because I can´t use it as it changes on every page. Is this ok?
Google Sheets always gives me a #N/A error?! Any idea how to scrape that number?

disabling JavaScript reveals what you can scrape:

Using Google Sheets for web scraping. Need the correct xpath for IMPORTXML function

There is a google sheet containing a list of MPN's (manufacturer part numbers). Trying to scrape a site called wikiarms for the UPC Codes when I have the MPN for an item.
I have the correct formula for doing this on another site.
=IMPORTXML("http://gun.deals/search/apachesolr_search/"&B1,"//dd/a[../../dt[contains(text(),'UPC')]]|//dd/span[../../dt[contains(text(),'UPC')]]")
Trying to figure out what the correct xpath to complete this formula. Some videos I have watch said to open the page in Chrome and use inspector to select and copy the xpath to complete the importxml function. I tried this with no luck.
Sample
Visit https://www.wikiarms.com/guns?q=20071
In the table there is a button "available in 6 stores" click that to reveal the list. The UPC should be listed after the MPN.
If I copy the xpath in Chrome this is the result
/html/body/div[1]/div/div/div[2]/div/div/div[2]/div[2]/table/tbody/tr[2]/td[5]
=IMPORTXML("https://www.wikiarms.com/guns?q="&B2,"xpath here")
What do I have to add at the end of this formula to pull in the UPC code? I will be using this formula to pull in UPC code for about 1000 items.
Thank you for your help.

Using your sample link, try
=IMPORTXML("https://www.wikiarms.com/guns?q=20071","//td[#class='upc']/a/#title")
and see if it works for you.

Confused about scrapy and Xpath

I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!

The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".

Can't get the data using importXML from Dynamic Web Page?

The website is : https://www.futbin.com/18/player/2600/Ayhan/
I inspect the element and get the XPath which is: //*[#id="ps-lowest-1"]
Then I use:
=IMPORTXML("https://www.futbin.com/18/player/2600/Ayhan/","//*[#id='ps-lowest-1']")
To get the data which should be 2000
But instead it only shows: - on the sheet. No errors just doesn't show the data that I want it to. Is there anyway to get the data that I need?
Thanks

The Sheets command importXML reads only the HTML source of the page without executing any JavaScript on it. As you can see yourself by using "view source" in the browser, the source indeed has "-" in that span:
<span class="price_big_right">
<span id="ps-lowest-1">-</span>
</span>
The actual numbers are loaded by some JavaScript file which then inserts them in that span. Neither importXML nor other Sheets functions can retrieve dynamically inserted data.
Sometimes, after inspecting the JS files, one can uncover the URL of source of data and try to import that; but this is a tedious reverse engineering exercise for each particular site.

My importXML + xPath + Soundcloud playlist Not Working

So How To Use importXML By Google Docs Spreadsheet Plus xPath To Help Me Copy Soundcloud Playlist Track Title?
Today I'm looking for a solution on how to copy the Track Title of tracks inside a Soundcloud Playlist/Compilation.
I searched around and discovered importXML function offered by Google Docs Spreadsheet. While digging further about the importXML function, I discovered about xPath.
Great combination! I thought.
So I quickly get my hands on the tools and tested it, and it's great! I've extracted some data. And so I decided that I'm ready to implement the tool and use it with Soundcloud.
But when I tried to implement the syntax, I got an error saying Import Internal Error
The Syntax Is
=IMPORTXML(A1,"//div[#class='sc-media-content']/a[#title]")
<div class="sc-media-content"> is the div that holds the Track Title of the song, that is enclosed within an anchor tag with title attribute.
Here is the html block for it;
<div class="sc-media-content">
<a class="soundTitle__title sc-link-dark sc-truncate " href="/seven-lions/velvetine-the-great-divide?in=thedubstepgod/sets/melodic-dubstep-chillstep" title="Velvetine - The Great Divide (Seven Lions Remix)">
Velvetine - The Great Divide (Seven Lions Remix)
</a>
What I'm trying to extract is the Velvetine - The Great Divide (Seven Lions Remix). A1 is the cell where the Soundcloud Playlist Link is pasted into.
Other Syntax I've Tried
I've tried other syntax too, like;
=IMPORTXML(A1,"//div[#class='sc-media-content']/title")
As suggested here
=IMPORTXML(A1,"//div[#class='sc-media-content']/#title") From an answer found here, though it wasn't an accepted answer.
So what am I doing wrong? How could I copy those soundcloud playlist title to my Google Docs Spreadsheet using xPath?
UPDATE
Based from the answer given by TGH, this should work //div[#class='sc-media-content']/a/text().
But, the problem is that the div block that we're looking for is not in the source code. I did a view source code on the Playlist's page and the div block is not there. All I'm seeing is JavaScripts. So JS is loading the div blocks/classes.
So another question might be needed to help solve this one
How to use xPath with JavaScript loaded html elements?

Try the following
//div[#class='sc-media-content']/a/text()
Or if you want to grab it from the title do this
//div[#class='sc-media-content']/a/#title
I tested it here and it seems to work. Pasted your html, but had to close the div manually.
http://www.unit-testing.net/Xpath

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

xpath extract URL - Scrapy - xpath

Related

Import XML of span element fails with #N/A

Using Google Sheets for web scraping. Need the correct xpath for IMPORTXML function

Confused about scrapy and Xpath

Can't get the data using importXML from Dynamic Web Page?

My importXML + xPath + Soundcloud playlist Not Working

Categories

Resources