Is there a way to programatically get the emoji unicode starting from css? - emojione

I have to get the Emoji unicode starting from some css, like that:
<img src=""
class="smilie smilie--sprite smilie--sprite2" alt=";)" title="Wink ;)" loading="lazy"
data-shortname=";)">
<img src=""
class="smilie smilie--sprite smilie--sprite3" alt=":(" title="Frown :(" loading="lazy"
data-shortname=":(">
In the first case it is possible, because the first word of the title (wink), is the short name for the emoji.
In the second case, the title does not correspond to the short name of the emoji.
Someone knows if this is possible?
thanks in advance.

Related

Pandoc 2.x renders images' alternative texts in an inaccessible fashion

Since I upgraded from Pandoc v1.19 to 2.9, decorative images are not exported as expected anymore.
First of all, when generating HTML from ![](test.jpg), in v1.19 a <p class="figure"> structure was wrapped around the image, but now it's only a <p>:
<p>
<img src="test.jpg">
</p>
This makes it harder to style in line with other images that have an alternative text.
But what's really a problem here: there's no alt="" attribute produced anymore! This means that e.g. screen readers will not recognise this as a decorative image anymore.
So let's see what happens to an image with an actual alternative text, e.g. when generating HTML from ![Hello](test.jpg):
<div class="figure">
<img src="test.jpg" alt="">
<p class="caption">Hello</p>
</div>
Here we get a class="figure" in the surrounding element, but now it's a <div> instead of a <p> (I don't bother too much about this, but again, it makes it harder to style everything the same).
What again is a big problem though is the fact that the alt attribute is now set empty: this prevents screen readers from perceiving them at all, which is horribly wrong! I guess that Pandoc concludes that having alternative text and caption would be redundant, which is correct, and that the caption below would be the right thing to show - which it is not.
The right structure would look something like this:
<div class="figure">
<img src="test.jpg" alt="Hello"><!-- Leave the alternative text on the image -->
<p class="caption" aria-hidden="true">Hello</p><!-- Hide the redundant visual alternative text from screen readers -->
</div>
Any reason why this behaviour would make sense? Can it be changed somehow? Otherwise I will have to fiddle around with some post-processing JavaScript...
The ![](test.jpg) example is no longer treated as a figure, because pandoc now requires that
the image is the only element in a paragraph, and
it has a caption.
Wrapping of figures with <div> happens when exporting to HTML4. Using the latest pandoc 2.9.2.1 and running pandoc -t html5 on the input ![Hello](test.jpg)
<figure>
<img src="test.jpg" alt="" /><figcaption>Hello</figcaption>
</figure>
The rationale for emitting an empty alt attribute is that screen readers would read the caption twice: first the alt, then the figcaption. Your suggestion seems much better, please open an issue.
If you can't wait for a new release, then use a Lua filter to create figures the way you like:
function Para (p)
if #p.content == 1 and p.content[1].t == "Image" then
local image = p.content[1]
local figure_content = pandoc.List{}
figure_content:insert(image)
figure_content:insert(
pandoc.RawInline('html', '\n<p class=caption aria-hidden="true">'))
figure_content:extend(image.caption)
figure_content:insert(pandoc.RawInline('html', '</p>'))
local attr = pandoc.Attr("", {"figure"})
return pandoc.Div({pandoc.Plain(figure_content)}, attr)
end
end

XPath formula for an uncommon second URL attribute in <img> element

Having difficulty to get the correct XPath to scrape the real URL of any image of my Scoop.it topic. Here is the code excerpt centered on one image. Other images are treated the same way.
<div class="thisistherealimage" >
<img id="Here a specific image ID" width="467" height="412"
class="postDisplayedImage lazy"
src="/resources/img/white.gif"
data-original="https://img.scoop.it/jKj7v6ojzPtACT6EaeztHTl72eJkfbmt4t8yenImKBVvK0kTmF0xjctABnaLJIm9"
alt="Here an alternative text" style="width:467; height: 412;" />
So, in this code sample, I dont want to scrape "/resources/img/white.gif" but the URL following the "data-original" attribute!
I'd like to capture the the data-original attribute, not only to capture it when it contains a URL.
As an XPath beginner, I've tried //div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage')][contains(#class,'lazy')]!
But it's not specific to data-original attribute. Isn't it?
Any advice?
If you want the data-original, you can access like this:
//div[contains(#class,'thisistherealimage')]/img[contains(#class,'postDisplayedImage') and contains(#class,'lazy')]/#data-original

Delete regex phrase from text

I have a text with news where i got html attributes that i don't need. How can i delete phrases in ruby such as
img width="750" alt="4.jg" c="/unload/medialiy/df6/4.jg" height="499"
title=4.jg"
img width="770" alt="5.jg" c="/unload/medialiy/ty6/5.jg"
height="499" title=5.jg"
So i need some regex smth like news.sub('/img*jg"/, ''). but it doesn't work.
I would use:
img .*\.jg"
test
if you want to say in regex "any symbols in any quantity", use .* Dot means any symbol, and star - any quantity.
But are you sure you don't want to include angle braces?
<img .*\.jg">
As an aside, what if the order of attributes will be changed? Then you'll fail to match the img tag. We really need img tag with .jg" substring in it.
<img [^>]*\.jg"[^>]*>
test
In your particular case you can do this:
element = '<img width="750" alt="4.jg" c="/unload/medialiy/df6/4.jg" height="499" title="4.jg">'
puts element.gsub(/(width|alt)=\"[^ ]+\" ?/, '')
You can also play around with this regex here.
But if you need a more robust solution, try to take a look at the Nokogiri gem. This SO question can help.

How to use XPATH to find an image called *logo*, or which has a class with the word *logo* in it?

I am creating a crawler which needs to download the logo from every website it crawls.
It is quite hard to detect which image is the logo, however I don't need 100% accuracy, so I am thinking of just looking for <img> tags which fulfil any of the following conditions:
A. The name of the image in the <img> tag has the word "logo" in it, for example:
<img src="logo.gif">
<img src="site-logo.jpg">
<img src="mainlogo.png">
B. The class or id in the <img> tag has the word logo in it, for example:
<img class="logo" src="something.gif">
<img id="main-logo" src="something.gif">
<img class="background logo" src="something.gif">
I've tried following the W3C XPATH documentation, but it is not very user friendly. I've also tried using what are supposed to be wildcards (according to w3schools) but they do not appear to work as expected.
Is it possible to achieve what I want using XPATH? Could you help provide some pointers or example code?
Thank you.
You could use:
/html/body//img[contains(#src, 'logo') or contains(#id, 'logo') or contains(#class, 'logo')]
which will find all img tags that are a descendant of the body tag, where the src, id or class attribute contains the text logo.

Matching hyperlinked images using preg_match

Been working on a preg_match expreesion to match a hyperlinked image. Need to use preg_match because the data comes from a database and need to some alternations before rendering in html
This is my match experession but this won't do a match on the example hyperlink below.
/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*><img(.*)<\/a>/msU
<a href="http://lifehacker.com/5936040/improvise-a-laptop-cooling-pad-with-plastic-bottle-caps" caps"="" bottle="" plastic="" with="" pad"="" title="Click here to read Improvise a Laptop Cooling ">
Is there something wrong with the expression that I'm (obviously) not seeing?
Thanks.

Resources