XPath formula for an uncommon second URL attribute in <img> element - xpath

Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?

If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original

Related

I want to extract Url from an anchor tag in ui path

I'm looking for a way to extract URLs from an anchor tag, anchor tag are rendering like this in DOM.
<a target="_blank" id="Tile_WPQ8_1_3" href="#" onclick="PreventDefaultNavigation(); return false;" hrefaction="https://institutes.kpmg.us/global-energy/webcasts/2020/resilience-in-energy-3.html" clickaction="null"></a>
I want the value of hrefaction, I'm trying below code- it's a Data scraping
<extract>
<column name2='Url' attr2='href' exact='0' name='Name' attr='text'>
<webctrl tag="a"/>
</column>
</extract>
but it is giving me just href value but as we can see in above pattern value present in hrefaction
Ant lead highly appreciated!
You can use Get Attribute activity of UI path, to get the value of attribute that you want.
And if the Get Attribute is not working because you cannot access the web element or you get the data from somewhere else you can still use a simple regex like this:
using the expression hrefaction="(.+)" .

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Thymeleaf parse text and execute in-text expressions

I have text string, that contains links, for example, like <a th:href="'someLink'">Download</a> .
I need to process that text and replace th:href="'someLink'" with correct links to show text with Download.
The text with links is stored in variable textThatContainsLinks.
My code to show text is <div th:utext="${textThatContainsLinks}">. I also tried to use preprocessing like <div th:utext="${__textThatContainsLinks__}">.
Currently this code shows links not as I expected, but non-preprocessed, ie, output is <a th:href="'someLink'">Download</a> now.
How to pre-process expressions in text, before showing it?
Thank you very much!
Take the context path and directly attach it to the relative path of a pure html5 attribute e.g LINK, <img src="/contextPath/relative/path/image.jpg" width="50" height="50" alt="logo"/>.
Notice how simple the accessibility to the resource is: /contextPath/relativePath, so the most important path there is the relative path. This is similar to Thymeleaf was unable to render <img> tag when sent from database table. I observed that once thymeleaf's namespace th: qualifies a href or src attribute that resides inside a text/String the absolute path is not properly resolved.

How to use XPATH to find an image called *logo*, or which has a class with the word *logo* in it?

I am creating a crawler which needs to download the logo from every website it crawls.
It is quite hard to detect which image is the logo, however I don't need 100% accuracy, so I am thinking of just looking for <img> tags which fulfil any of the following conditions:
A. The name of the image in the <img> tag has the word "logo" in it, for example:
<img src="logo.gif">
<img src="site-logo.jpg">
<img src="mainlogo.png">
B. The class or id in the <img> tag has the word logo in it, for example:
<img class="logo" src="something.gif">
<img id="main-logo" src="something.gif">
<img class="background logo" src="something.gif">
I've tried following the W3C XPATH documentation, but it is not very user friendly. I've also tried using what are supposed to be wildcards (according to w3schools) but they do not appear to work as expected.
Is it possible to achieve what I want using XPATH? Could you help provide some pointers or example code?
Thank you.
You could use:
/html/body//img[contains(#src, 'logo') or contains(#id, 'logo') or contains(#class, 'logo')]
which will find all img tags that are a descendant of the body tag, where the src, id or class attribute contains the text logo.

Xpath of a text containing Bold text

I am trying to click on the link whose site is www.qualtrapharma.com‎ by searching in google
"qualtra" but there is problem in writing xpath as <cite> tag contains <B> tag inside it. How to do any any one suggest?
<div class="f kv" style="white-space:nowrap">
<cite class="vurls">
www.
<b>qualtra</b>
pharma.com/
</cite>
<div>
You may overcome this by using the '.' in the XPath, which stands for the 'text in the current node'.
The XPath would look like the following:
//cite[.='www.qualtrapharma.com/']

Resources