How to use XPATH to find an image called *logo*, or which has a class with the word *logo* in it? - xpath

I am creating a crawler which needs to download the logo from every website it crawls.
It is quite hard to detect which image is the logo, however I don't need 100% accuracy, so I am thinking of just looking for <img> tags which fulfil any of the following conditions:
A. The name of the image in the <img> tag has the word "logo" in it, for example:
<img src="logo.gif">
<img src="site-logo.jpg">
<img src="mainlogo.png">
B. The class or id in the <img> tag has the word logo in it, for example:
<img class="logo" src="something.gif">
<img id="main-logo" src="something.gif">
<img class="background logo" src="something.gif">
I've tried following the W3C XPATH documentation, but it is not very user friendly. I've also tried using what are supposed to be wildcards (according to w3schools) but they do not appear to work as expected.
Is it possible to achieve what I want using XPATH? Could you help provide some pointers or example code?
Thank you.

You could use:
/html/body//img[contains(#src, 'logo') or contains(#id, 'logo') or contains(#class, 'logo')]
which will find all img tags that are a descendant of the body tag, where the src, id or class attribute contains the text logo.

Related

Laravel change img src on data fetch

I am retrieving text which contains images saved in WYSIWYG editor(Summernote). Is there a way to replace src attribute value in img tags using asset()?
Example:
<img src="images/image.jpg"/>...
To:
<img src="https://.../images.jpg"/>
I want solution which would cover all bases: spaces in image name, different extensions...
Sure, just use the curly brace syntax in your blade to render the asset()
<img src="{{ asset('whatever_you_want') }}"/>
I don't think you can do it in Blade. You could, in your model, add a function that replaces all images to full paths. This could be done through a regex pattern that looks for URLs in tags.
I would, however, make sure the full path to the image is included in the text in the database. This way, you always have access to the right path to the image, and you're not relying on a piece of code to display the right image.

Telegram's instant view API: Element <img> is not supported in <p>

I have problem when trying to create my Telegram's Instant View template, with this error:
Element <img> is not supported in <p>: <img src="mysrc" />
So, I decided to replace tag <p> if has <img> tag with <figure> tag
#replace_tag(<figure>): $body//p//img
But the result is not showing the image. FYI, the <img> doesn't have attributes except src.
Sample code:
<p><img src="mysrc"/></p>
I have no idea, please help me
The problem with your code is it replaces the <img>.
Like what you've said, you want to replace <p> with <figure>. So replace the tag <p> with <img> children.
#replace_tag(<figure>): $body//p[.//img]
The more simple way is, <figure>: $body//p[.//img]
I add the
#split_parent: //p/img
It works! Altough I don't know the reason...

XPath formula for an uncommon second URL attribute in <img> element

Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?
If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original

Extracting links (get href values) with certain text with Xpath under a div tag with certain class

SO contributors. I am fully aware of the following question How to obtain href values from a div using xpath?, which basically deals with one part of my problem yet for some reason the solution posted there does not work in my case, so I would kindly ask for help in resolving two related issues. In the example below, I would like to get the href value of the "more" hyperlink (http://www.thestraddler.com/201715/piece2.php), which is under the div tag with content class.
<div class="content">
<h3>Against the Renting of Persons: A conversation with David Ellerman</h3>
[1]
</p>
<p>More here.</p>
</div>
In theory I should be able to extract the links under a div tag with
xidel website -e //div[#class="content"]//a/#href
but for some reason it does not work. How can I resolve this and (2nd part) how can I extract the href value of only the "here" hyperlink?

Thymeleaf parse text and execute in-text expressions

I have text string, that contains links, for example, like <a th:href="'someLink'">Download</a> .
I need to process that text and replace th:href="'someLink'" with correct links to show text with Download.
The text with links is stored in variable textThatContainsLinks.
My code to show text is <div th:utext="${textThatContainsLinks}">. I also tried to use preprocessing like <div th:utext="${__textThatContainsLinks__}">.
Currently this code shows links not as I expected, but non-preprocessed, ie, output is <a th:href="'someLink'">Download</a> now.
How to pre-process expressions in text, before showing it?
Thank you very much!
Take the context path and directly attach it to the relative path of a pure html5 attribute e.g LINK, <img src="/contextPath/relative/path/image.jpg" width="50" height="50" alt="logo"/>.
Notice how simple the accessibility to the resource is: /contextPath/relativePath, so the most important path there is the relative path. This is similar to Thymeleaf was unable to render <img> tag when sent from database table. I observed that once thymeleaf's namespace th: qualifies a href or src attribute that resides inside a text/String the absolute path is not properly resolved.

Resources