"Imported content is empty." error when scraping with ImportXML in GSheets - xpath

I need to scrape images' source URLs from a directory's linked web pages to columns into a Google Sheet.
I think using IMPORTXML function would be the easiest solution, but I get the #N/A "Imported content is empty." error every time.
I have tried to use this extension as well to define XPath, but still the same error.
The page's source code, where image source URL is:
<div class="centerer" id="rbt-gallery-img-1">
<i class="spinner">
<span></span>
</i>
<img data-lazy="//i.example.com/01.jpg" border="0"/>
</div>
So I want to get "i.example.com/01.jpg" value to B2, followed by further images' URLs to adjacent cells.
The function I used is:
=IMPORTXML(A2,"//img[#class='centerer']/#data-lazy")
I tried using spinner instead of centerer, with the same result.

You can get the string i.example.com/01.jpg with the following XPath-1.0 expression:
substring-after(//div[#class='centerer']/img/#data-lazy,'//')
If you don't need to remove the leading //, you can only use
//div[#class='centerer']/img/#data-lazy
So, in the first case, the Google-Sheets expression could be
=IMPORTXML(A2,"substring-after(//div[#class='centerer']/img/#data-lazy,'//')")
and in the second it could be
=IMPORTXML(A2,"//div[#class='centerer']/img/#data-lazy")

Related

Get a #text element found in a span with importxml and Gsheet

I am trying to get the value of a #text (number of likes) inside a span from this URL via importXML in Google Spreadsheet using XPath.
I tried so many ways but it doesn't work...
Any ideas ?
<div class="RANLXG3qKB61Bh33I0r2 NO_VO3MRVl9z3z56d8Lg"><a draggable="false" class="Czg_RoYmXG0FPTHG9Kdb" href="/user/spotify">Spotify</a></div>
<span class="RANLXG3qKB61Bh33I0r2 Hi9FqPX1LNRRPf31tfA8" as="span">150 815 likes</span>
Spotify
Generic XPath to match your span tag is:
//span[contains(text(), 'like')]
This is a match for the sample url provided. The same "page" with, for example, another artist selected changes the html and can result in no match or several matches.
This example was checked against geckodriver

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Thymeleaf parse text and execute in-text expressions

I have text string, that contains links, for example, like <a th:href="'someLink'">Download</a> .
I need to process that text and replace th:href="'someLink'" with correct links to show text with Download.
The text with links is stored in variable textThatContainsLinks.
My code to show text is <div th:utext="${textThatContainsLinks}">. I also tried to use preprocessing like <div th:utext="${__textThatContainsLinks__}">.
Currently this code shows links not as I expected, but non-preprocessed, ie, output is <a th:href="'someLink'">Download</a> now.
How to pre-process expressions in text, before showing it?
Thank you very much!
Take the context path and directly attach it to the relative path of a pure html5 attribute e.g LINK, <img src="/contextPath/relative/path/image.jpg" width="50" height="50" alt="logo"/>.
Notice how simple the accessibility to the resource is: /contextPath/relativePath, so the most important path there is the relative path. This is similar to Thymeleaf was unable to render <img> tag when sent from database table. I observed that once thymeleaf's namespace th: qualifies a href or src attribute that resides inside a text/String the absolute path is not properly resolved.

Unable to extract the data through xpath or css selector

When I do inspect element or view source, the required data is available on page, but when I extract them by using xpath or css, I am getting an empty list. Even I tried to extract all the nodes and it's content but that required data which was shown in View page source are not getting extracted. What could be the reason?
Below is the example code:
I need to extract href value from tag.
<div class="url-link">
<a data-id="abc" class="abc xyz" data-is-avod="" href="/ab/extract/xyz/3&t=25">Title</a>
<span>title</span>
</div>
I used response.xpath('//div/a/#href').extract() xpath but I am unable to extract the desired content.
I have analyzed and found when I logged in to the website then only inspect element or View page source shows this <a> tag else it does not show. So i think to get the #href text i need to pass the form with login information, but I don't know how to pass a form and how to get details of the form.
Please help.

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

Dear friends,I want to extract text 平均3.6 星 from this code segment excerpted from amazon.cn.
<div class="content"><ul>
<li><b>用户评分:</b>
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" ref="dp_db_cm_cr_acr_pop_" name="B004GUSIKO">
<a>
<span class="swSprite s_star_3_5 " title="平均3.6 星">
<span>平均3.6 星</span>
</span>
</a>
My question is span class tag value "s_star_3_5 " vary from different customer's rating level and appended dynamically. So I attempt to use doc.DocumentNode.SelectSingleNode(" //span[#class='swSprite']").InnerText or //span[#class='swSprite s_star_3_5 '], but the result is an error or not what my want !
Any suggestions?
First of all, I suggest you saving the value of doc.DocumentNode.OuterHtml to a local .html file and see if the code you're obtaining is that code. The thing is that sometimes you start parsing a website using HtmlAgilityPack, but the very first problem is that you're not getting the valid HTML correctly. Maybe you're getting a 404 error, or a redirection, etc.
I'm suggesting this because I tested //span[#class='swSprite s_star_3_5 '] and worked correctly.
That was the issue in the following questions:
Selecting nodes that have an attribute with spaces using HTMLAgilityPack
XPath Query Problem using HTML Agility Pack
If that doesn't help, post the HTML code and I'll help you ;)
This works for me:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtml);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]");
Console.WriteLine("Text=" + node.InnerText.Trim());
and outputs
平均3.6 星
Note I use the XPATH starts-with function.

Resources