Using relative xpath to scrape custom div attribute - xpath

I have a few hundred URLs where I'm trying to scrape the image path for an image on a page. Each page is the same format, but the div class is unique to each page.
I want to be able to use import xml in Google sheets to scrape just the content of the data-path element.
I've tried and failed to use xpath to pull out the URLs.
<div class="uniqueid active" data-path="/~/media/Images/image.jpg" data-alt="Anything"></div>
E.g. //div[#class='*']/#data-path"
Example of site: https://www.cannondale.com/en/Australia/Bike/ProductDetail?Id=77d3b8fe-41f7-42b6-bf69-b5cf0ae55548&parentid=undefined

If div class has the pattern "uniqueid active", then you can try the following XPath:
//div[contains(#class, "active")]/#data-path
Otherwise, if div class can be anything, use this query:
//div[#class]/#data-path
UPDATE:
I tried to get values of data-path attributes with IMPORTXML, but didn't succeed. Tried to do it using Python (requests and lxml) and it works. So probably the problem is in Google Sheets - some limitations or bugs, idk.

Related

Get a #text element found in a span with importxml and Gsheet

I am trying to get the value of a #text (number of likes) inside a span from this URL via importXML in Google Spreadsheet using XPath.
I tried so many ways but it doesn't work...
Any ideas ?
<div class="RANLXG3qKB61Bh33I0r2 NO_VO3MRVl9z3z56d8Lg"><a draggable="false" class="Czg_RoYmXG0FPTHG9Kdb" href="/user/spotify">Spotify</a></div>
<span class="RANLXG3qKB61Bh33I0r2 Hi9FqPX1LNRRPf31tfA8" as="span">150 815 likes</span>
Spotify
Generic XPath to match your span tag is:
//span[contains(text(), 'like')]
This is a match for the sample url provided. The same "page" with, for example, another artist selected changes the html and can result in no match or several matches.
This example was checked against geckodriver

"Imported content is empty." error when scraping with ImportXML in GSheets

I need to scrape images' source URLs from a directory's linked web pages to columns into a Google Sheet.
I think using IMPORTXML function would be the easiest solution, but I get the #N/A "Imported content is empty." error every time.
I have tried to use this extension as well to define XPath, but still the same error.
The page's source code, where image source URL is:
<div class="centerer" id="rbt-gallery-img-1">
<i class="spinner">
<span></span>
</i>
<img data-lazy="//i.example.com/01.jpg" border="0"/>
</div>
So I want to get "i.example.com/01.jpg" value to B2, followed by further images' URLs to adjacent cells.
The function I used is:
=IMPORTXML(A2,"//img[#class='centerer']/#data-lazy")
I tried using spinner instead of centerer, with the same result.
You can get the string i.example.com/01.jpg with the following XPath-1.0 expression:
substring-after(//div[#class='centerer']/img/#data-lazy,'//')
If you don't need to remove the leading //, you can only use
//div[#class='centerer']/img/#data-lazy
So, in the first case, the Google-Sheets expression could be
=IMPORTXML(A2,"substring-after(//div[#class='centerer']/img/#data-lazy,'//')")
and in the second it could be
=IMPORTXML(A2,"//div[#class='centerer']/img/#data-lazy")

XPath formula for an uncommon second URL attribute in <img> element

Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?
If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original

Unable to extract the data through xpath or css selector

When I do inspect element or view source, the required data is available on page, but when I extract them by using xpath or css, I am getting an empty list. Even I tried to extract all the nodes and it's content but that required data which was shown in View page source are not getting extracted. What could be the reason?
Below is the example code:
I need to extract href value from tag.
<div class="url-link">
<a data-id="abc" class="abc xyz" data-is-avod="" href="/ab/extract/xyz/3&t=25">Title</a>
<span>title</span>
</div>
I used response.xpath('//div/a/#href').extract() xpath but I am unable to extract the desired content.
I have analyzed and found when I logged in to the website then only inspect element or View page source shows this <a> tag else it does not show. So i think to get the #href text i need to pass the form with login information, but I don't know how to pass a form and how to get details of the form.
Please help.

XPATH - Ignore 1 page element, grab the rest

I can't seem to figure this out due to the square brackets issue. I have an html page full of h3 tags with hrefs that I need to grab but one has a class on that I don't want. Example:
I want all H3s hrefs but not this one:
<h3 class="leave_this">Leave me alone!
To grab all the hrefs I am using this:
//h3/a/#href
Tried a few variations but no luck.
REMOVED CONFUSING EXAMPLE, APOLOGIES

Resources