Unexpected subpatterns captured by preg_match - preg-match

I am doing some regular expressions in php and matching using the preg_match();
I have a text that might look something like this:
$imy = "...without sophisticated apparatus<div class="caption"><div class="caption-inner">
<img src="http://dev.mysite.org/Heatmap.png" alt="" title="" class="image-thumbnail" />
Caption text</div></div>Some more text...
<img src="http://dev.mysite.org/Heatmap.png" alt="" title="" class="image-thumbnail" />blablah..."
and my goal is to pick out either the "img" tag enclosed in the "div" tags(including the "div" tags) or just the "img" if it is not enclosed in divs. I also in each case want to capture the address contained in the src attribute of the "img" tag.
This is the pattern I use:
$imagepattern = '/<div class="caption-inner[^>]+>.*<img\b[^>]*\bsrc="([^">]*)"[^>]*>.*<\/div>(<\/div>)?|<img\b[^>]*\bsrc="([^">]*)"[^>]*>/Us';
and it works great for "div" enclosed images, but for the divless images I get weird results for the captured subpattern.
I iteratively call preg_match and remove the match from the subject string before resending it to preg_match. My call to preg_match looks like this:
preg_match($imagepattern,$imy,$image,PREG_OFFSET_CAPTURE)
What I get in my image array when matching against a divless image tag looks like this:
$image = [0] => Array
(
[0] => <img src="http://dev.molmeth.org/Heatmap.png" alt="" title="" class="image-thumbnail" />
[1] => 1
)
[1] => Array
(
[0] =>
[1] => -1
)
[2] => Array
(
[0] =>
[1] => -1
)
[3] => Array
(
[0] => http://dev.mysite.org/Heatmap.png
[1] => 11
)
How can the $image array have the '2' and '3' keys? Don't I only have one subpattern? Is this somehow because of the 'or' condition in the pattern?

in your preg_match expression you have 3 capture groups.
the whole expression matches because of the or (since you search div included images OR divless images)
for divless images, only capture group 3 will be filled data and capture groups 1 & 2 will be empty.

Related

How to correcly use HTML tags in translation Laravel?

I have a code <span>{{ trans('lang.color.' . $bet->color) }}</span></div> which displays the bet amount for a specific color.
My lang file:
'color' => [
'red' => 'red',
'zero' => 'green',
'black' => 'black',
],
Which is responsible for the fact that if the bet was placed on red, the site will say: set to red.
How can I correctly display HTML code in color variables? For example, if i write 'red' => '<div style="font-color:#FF0000">red</div>' site does not convert the text to HTML, and writes the div with text. How to make read text file HTML?
My laravel version: 5.1.10.
Displaying Unescaped Data
By default, Blade {{ }} statements are automatically sent through PHP's htmlentities function to prevent XSS attacks. If you do not want your data to be escaped, you may use the following syntax:
{!! trans('lang.color.' . $bet->color) !!}
Note: Be very careful when echoing content that is supplied by users of your application. Always use the double curly brace syntax to escape any HTML entities in the content.

How to split by HTML tags using a regex

I have a string like this:
"Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39"
and I want to split it by its HTML tags, which are always <span>. I tried something like:
my_string.split(/<span(.*)span>/)
but it didn't work, it only matched the first element correctly.
Does anyone know what is wrong with my regex? In this example, I expected the returned value to be:
["Energia Elétrica kWh", "10.942", "0,74999294" ,"8.206,39"]
I would like something like strip_tags, but instead of returning the string sanitized, get the array split by the tags removed.
Don't use a pattern to manipulate HTML. It's a path destined to make you insane.
Instead use a HTML parser. The standard for Ruby is Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39")
You could use text to extract all the text, but, if it's structured data you're after, that often makes it difficult to extract the fields because the text nodes can be concatenated resulting in run-on words, so be careful there:
doc.text # => "Energia Elétrica kWh 10.942 0,74999294 8.206,39"
Instead we typically extract the data from individual nodes:
doc.search('span')[1].next_sibling.text # => " 0,74999294 "
doc.search('span').last.next_sibling.text # => " 8.206,39"
Or, we iterate over the nodes, then use map to grab the node's text:
doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["10.942", "0,74999294", "8.206,39"]
I'd go about the problem like this:
data = [doc.at('span').previous_sibling.text.strip] # => ["Energia Elétrica kWh"]
data += doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
Or:
spans = doc.search('span')
data = [
spans.first.previous_sibling.text,
*spans.map{ |span| span.next_sibling.text }
].map(&:strip)
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
While a regular expression can often work on an initial attempt, a change in the format of the HTML can break the pattern, forcing an additional change, then another change, and then another, until the pattern is too convoluted, whereas a properly written parser approach will typically be very resilient and immune to the problem.
If you really need to use regex to do this, you pretty much had it already.
irb(main):010:0> string.split(/<span.+?span>/)
=> ["Energia Eltrica kWh", " 10.942 ", " 0,74999294 ", " 8.206,39"]
You just needed the ? to tell it to match as little as possible.

How to sanitalize string with nested html tags but keep <em> tag?

I am trying to sanitalize Solr search results, cause it has html tags inside:
ActionController::Base.helpers.sanitize( result_string )
It is easy to sanitalize not highlighted string like: I know <ul><li>ruby</li> <li>rails</li></ul>.
But when results is highlighted I have additional important tags inside - <em> and </em>:
I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>.
So, when I sanitalize string with nested html and highlighting tags, I get string with peaces of htmls tags. And it is bad :)
How can I sanitalize highlighted string with <em> tags inside to get correct result (string with <em> tags only)?
I found the way, but it's slow and not pretty:
string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'
['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag|
string.gsub!( "<<em>#{tag}</em>>", '' )
string.gsub!( "</<em>#{tag}</em>>", '' )
end
string = ActionController::Base.helpers.sanitize string, tags: %w(em)
How can I optimize it or do it using some better solution?
to write some regex and remove html_tags, but keep <em> and </em> e.g.
Please help, thanks.
You could call gsub! to discard all tags but keep only tags that are independent, or that are not included in html tag.
result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')
would do the trick
To explain:
# first group (<\/?[^e][^m]>)
# find all html tags that are not <em> or </em>
# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>> or <<em>ul</em>>
# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>> or </<em>ul</em>>
# and gsub replaces all of this with empty string
I think you can use the sinitize:
Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize #article.body, tags: %w(table tr td), attributes: %w(id class style) %>
So, something like that should work:
sanitize result_string, tags: %w(em)
With an additional parameter to sanitize, you can specify which tags are allowed.
In your example, try:
ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )
It should do the trick

xpath - grab content only if preceding div has certain word

I want to grab the textual content of this span class but only on the condition the word 'Country' is used in the code before it:
<li itemscope="" itemtype="http://data-vocabulary.org/Breadcrumb"><a href="/testurl.html"
itemprop="url" onclick="ta.setEvtCookie('Breadcrumbs', 'click', 'Country', 2, this.href); ">
<span itemprop="title">China</span></a><img src="http://imagepath.gif" class="fake class"
alt="">
Does anyone know how I can do this?
To be clear, if the xpath query sees the word 'Country' I want it to return the word 'China'.
Instead of checking previous element, try to check parent element. Because, in the sample markup, the span is located within the element that contains word 'Country' :
//span[parent::a[contains(#onclick,'Country')]]
Above XPath search for <span> element that has parent <a> element with attribute onclick value contains 'Country'.

Smarty Variable - Hyphen in Array Key

Trying to display a Smarty variable with a hyphen in the key. Nothing I can do to change the fact that it has a hyphen in the key.
For example, a phone number may be stored within the $form array as:
phone-1-1 => Array (9)
name => "phone-1-1"
value => "(555) 555-5555"
type => "text"
frozen => false
required => false
error => null
id => "phone-1-1"
label => "<label for="phone-1-1">Phone Number (..."
html => "<input maxlength="32" size="20" name=..."
Trying to print the smarty variable using:
{$form.phone-1-1.label}
fails because of the hyphens.
Any ideas how I get around that?
The only workaround you can use is:
{assign var="mykey" value="phone-1-1"}
{$form.$mykey.label}
The bult-in Smarty function {assign} let you create variables directly in the template.
http://www.smarty.net/docs/en/language.function.assign.tpl (for Smarty 3)
http://www.smarty.net/docsv2/en/language.custom.functions.tpl (for Smarty 2)

Resources