How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically - html-agility-pack

Dear friends,I want to extract text 平均3.6 星 from this code segment excerpted from amazon.cn.
<div class="content"><ul>
<li><b>用户评分:</b>
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" ref="dp_db_cm_cr_acr_pop_" name="B004GUSIKO">
<a>
<span class="swSprite s_star_3_5 " title="平均3.6 星">
<span>平均3.6 星</span>
</span>
</a>
My question is span class tag value "s_star_3_5 " vary from different customer's rating level and appended dynamically. So I attempt to use doc.DocumentNode.SelectSingleNode(" //span[#class='swSprite']").InnerText or //span[#class='swSprite s_star_3_5 '], but the result is an error or not what my want !
Any suggestions?

First of all, I suggest you saving the value of doc.DocumentNode.OuterHtml to a local .html file and see if the code you're obtaining is that code. The thing is that sometimes you start parsing a website using HtmlAgilityPack, but the very first problem is that you're not getting the valid HTML correctly. Maybe you're getting a 404 error, or a redirection, etc.
I'm suggesting this because I tested //span[#class='swSprite s_star_3_5 '] and worked correctly.
That was the issue in the following questions:
Selecting nodes that have an attribute with spaces using HTMLAgilityPack
XPath Query Problem using HTML Agility Pack
If that doesn't help, post the HTML code and I'll help you ;)

This works for me:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtml);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]");
Console.WriteLine("Text=" + node.InnerText.Trim());
and outputs
平均3.6 星
Note I use the XPATH starts-with function.

Related

"Imported content is empty." error when scraping with ImportXML in GSheets

I need to scrape images' source URLs from a directory's linked web pages to columns into a Google Sheet.
I think using IMPORTXML function would be the easiest solution, but I get the #N/A "Imported content is empty." error every time.
I have tried to use this extension as well to define XPath, but still the same error.
The page's source code, where image source URL is:
<div class="centerer" id="rbt-gallery-img-1">
<i class="spinner">
<span></span>
</i>
<img data-lazy="//i.example.com/01.jpg" border="0"/>
</div>
So I want to get "i.example.com/01.jpg" value to B2, followed by further images' URLs to adjacent cells.
The function I used is:
=IMPORTXML(A2,"//img[#class='centerer']/#data-lazy")
I tried using spinner instead of centerer, with the same result.

You can get the string i.example.com/01.jpg with the following XPath-1.0 expression:
substring-after(//div[#class='centerer']/img/#data-lazy,'//')
If you don't need to remove the leading //, you can only use
//div[#class='centerer']/img/#data-lazy
So, in the first case, the Google-Sheets expression could be
=IMPORTXML(A2,"substring-after(//div[#class='centerer']/img/#data-lazy,'//')")
and in the second it could be
=IMPORTXML(A2,"//div[#class='centerer']/img/#data-lazy")

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Unable to extract the data through xpath or css selector

When I do inspect element or view source, the required data is available on page, but when I extract them by using xpath or css, I am getting an empty list. Even I tried to extract all the nodes and it's content but that required data which was shown in View page source are not getting extracted. What could be the reason?
Below is the example code:
I need to extract href value from tag.
<div class="url-link">
<a data-id="abc" class="abc xyz" data-is-avod="" href="/ab/extract/xyz/3&t=25">Title</a>
<span>title</span>
</div>
I used response.xpath('//div/a/#href').extract() xpath but I am unable to extract the desired content.
I have analyzed and found when I logged in to the website then only inspect element or View page source shows this <a> tag else it does not show. So i think to get the #href text i need to pass the form with login information, but I don't know how to pass a form and how to get details of the form.
Please help.

Xpath of a text containing Bold text

I am trying to click on the link whose site is www.qualtrapharma.com‎ by searching in google
"qualtra" but there is problem in writing xpath as <cite> tag contains <B> tag inside it. How to do any any one suggest?
<div class="f kv" style="white-space:nowrap">
<cite class="vurls">
www.
<b>qualtra</b>
pharma.com/
</cite>
<div>

You may overcome this by using the '.' in the XPath, which stands for the 'text in the current node'.
The XPath would look like the following:
//cite[.='www.qualtrapharma.com/']

HtmlUnit - getTextContent()

I´m working whith HTMLUnit, I need get text content of a HtmlAnchor but only text no more tags html have.
<a class="subjectPrice" href="http://www.terra.es/?ca=28_s&st=a&c=4" title="Opel Zafira Tourer 2.0 Cdti 165 Cv Excellence 5p. -12">
<span class="old_price">32.679€</span>
24.395€
If I execute htmlAnchor.getTextContent() it´s return 32.679€ 24.395€, but I only need 24.395€
Anybody can help me? thanks.

Just use XPath to get the appropriate DomText node. It seems that ./text() taking as a reference the HtmlAnchor should be enough.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically - html-agility-pack

This works for me: HtmlDocument doc = new HtmlDocument(); doc.Load(myHtml); HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]"); Console.WriteLine("Text=" + node.InnerText.Trim()); and outputs 平均3.6 星 Note I use the XPATH starts-with function.

Related

"Imported content is empty." error when scraping with ImportXML in GSheets

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

Unable to extract the data through xpath or css selector

Xpath of a text containing Bold text

HtmlUnit - getTextContent()

Categories

Resources