Matching hyperlinked images using preg_match - preg-match

Been working on a preg_match expreesion to match a hyperlinked image. Need to use preg_match because the data comes from a database and need to some alternations before rendering in html
This is my match experession but this won't do a match on the example hyperlink below.
/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><img(.*)<\/a>/msU
<a href="http://lifehacker.com/5936040/improvise-a-laptop-cooling-pad-with-plastic-bottle-caps" caps"="" bottle="" plastic="" with="" pad"="" title="Click here to read Improvise a Laptop Cooling ">
Is there something wrong with the expression that I'm (obviously) not seeing?
Thanks.

Related

"Imported content is empty." error when scraping with ImportXML in GSheets

I need to scrape images' source URLs from a directory's linked web pages to columns into a Google Sheet.
I think using IMPORTXML function would be the easiest solution, but I get the #N/A "Imported content is empty." error every time.
I have tried to use this extension as well to define XPath, but still the same error.
The page's source code, where image source URL is:
<div class="centerer" id="rbt-gallery-img-1">
<i class="spinner">
<span></span>
</i>
<img data-lazy="//i.example.com/01.jpg" border="0"/>
</div>
So I want to get "i.example.com/01.jpg" value to B2, followed by further images' URLs to adjacent cells.
The function I used is:
=IMPORTXML(A2,"//img[#class='centerer']/#data-lazy")
I tried using spinner instead of centerer, with the same result.
You can get the string i.example.com/01.jpg with the following XPath-1.0 expression:
substring-after(//div[#class='centerer']/img/#data-lazy,'//')
If you don't need to remove the leading //, you can only use
//div[#class='centerer']/img/#data-lazy
So, in the first case, the Google-Sheets expression could be
=IMPORTXML(A2,"substring-after(//div[#class='centerer']/img/#data-lazy,'//')")
and in the second it could be
=IMPORTXML(A2,"//div[#class='centerer']/img/#data-lazy")

XPath formula for an uncommon second URL attribute in <img> element

Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?
If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original

How to use XPATH to find an image called *logo*, or which has a class with the word *logo* in it?

I am creating a crawler which needs to download the logo from every website it crawls.
It is quite hard to detect which image is the logo, however I don't need 100% accuracy, so I am thinking of just looking for <img> tags which fulfil any of the following conditions:
A. The name of the image in the <img> tag has the word "logo" in it, for example:
<img src="logo.gif">
<img src="site-logo.jpg">
<img src="mainlogo.png">
B. The class or id in the <img> tag has the word logo in it, for example:
<img class="logo" src="something.gif">
<img id="main-logo" src="something.gif">
<img class="background logo" src="something.gif">
I've tried following the W3C XPATH documentation, but it is not very user friendly. I've also tried using what are supposed to be wildcards (according to w3schools) but they do not appear to work as expected.
Is it possible to achieve what I want using XPATH? Could you help provide some pointers or example code?
Thank you.
You could use:
/html/body//img[contains(#src, 'logo') or contains(#id, 'logo') or contains(#class, 'logo')]
which will find all img tags that are a descendant of the body tag, where the src, id or class attribute contains the text logo.

How to add and render multiple images/files in typo3 Fluid content element(flux)?

I am using flux and fluid content element for making content editable by user. i added field for image which allows multiple images to upload.
But now i am not able to show these images.
my value of field image is like :
image => 'kip.jpg,772_Visteon_010.jpg'
normally if i have only one value then i can show it by <f:image> or <img src="{image}" /> Tag.
so, anybody have idea how can i display multiple images or files.
Thanks in advance.
you seem to use the old style of image inclusion: comma separated list of names with copies below uploads/.
Then you need to split (https://docs.typo3.org/typo3cms/TyposcriptReference/Functions/Split/Index.html) the field and work the resulting array as before.
in the long time you should use FAL, so the handling is a little bit more complex
You should use flux inline fal for multiple images in fluid content element,
below is a syntax for inline fal,
<flux:field.inline.fal name="settings.image" label="Image" />
After that you can render it by following code,
<f:for each="{v:content.resources.fal(field: 'settings.image')}" as="image">
<f:image treatIdAsReference="1" src="{image.id}" title="{image.title}" alt="{image.alternative}"/><br/>
</f:for>
you can find detail in below url,
https://fluidtypo3.org/viewhelpers/flux/master/Field/Inline/FalViewHelper.html
hope this will help you.

How do HtmlAgilityPack extract text from html node whose class attribute appended dynamically

Dear friends,I want to extract text 平均3.6 星 from this code segment excerpted from amazon.cn.
<div class="content"><ul>
<li><b>用户评分:</b>
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" ref="dp_db_cm_cr_acr_pop_" name="B004GUSIKO">
<a>
<span class="swSprite s_star_3_5 " title="平均3.6 星">
<span>平均3.6 星</span>
</span>
</a>
My question is span class tag value "s_star_3_5 " vary from different customer's rating level and appended dynamically. So I attempt to use doc.DocumentNode.SelectSingleNode(" //span[#class='swSprite']").InnerText or //span[#class='swSprite s_star_3_5 '], but the result is an error or not what my want !
Any suggestions?
First of all, I suggest you saving the value of doc.DocumentNode.OuterHtml to a local .html file and see if the code you're obtaining is that code. The thing is that sometimes you start parsing a website using HtmlAgilityPack, but the very first problem is that you're not getting the valid HTML correctly. Maybe you're getting a 404 error, or a redirection, etc.
I'm suggesting this because I tested //span[#class='swSprite s_star_3_5 '] and worked correctly.
That was the issue in the following questions:
Selecting nodes that have an attribute with spaces using HTMLAgilityPack
XPath Query Problem using HTML Agility Pack
If that doesn't help, post the HTML code and I'll help you ;)
This works for me:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtml);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'swSprite')]");
Console.WriteLine("Text=" + node.InnerText.Trim());
and outputs
平均3.6 星
Note I use the XPATH starts-with function.

Resources