preg_match_all skippes one nested tag - preg-match

if you look at this tag:
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>'
I use this to preg_match:
preg_match_all("'<div class=\"inner\">(.*?)</div>'si", $text, $match); // de ul tags
$match[1] = array_splice($match[0], 0);
foreach($match[1] as $val) // hele pagina
{
echo $val;
}
Well i tried many things, but i only get whats between and never what i need for , what am i doing wrong?

Are you trying to get everything between the beginning and ending div tags? If so, then you're really close. All you'd need to do is just remove the question mark ? from your expression. The question mark tells the script to stop matching once it finds the next item in the REGEX. In this case, the next item is a closing div tag. So once it finds it, it stops. If you leave it out, it will keep matching until it hits the last div tag it can find.
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>';
preg_match_all("'<div class=\"inner\">(.*)</div>'si", $text, $match);
print "<pre><font color=red>"; print_r($match); print "</font></pre>";
If you're trying to pull out each item in a div, then you'd probably want to consider using DOM instead of REGEX to tackle this problem. But since you used the preg-match tag, then here it is in REGEX:
preg_match_all('~<div class="(?!inner).*?>\K(.*?)(?=</div>)~ims', $text, $matches);
print "<PRE><FONT COLOR=BLUE>"; print_r($matches[1]); print "</FONT></PRE>";
That gives you this:
Array
(
[0] =>
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
[1] =>
<span class="red">10:00 - 14:00</span>
)
Explanation of the REGEX:
<div class=" (?!inner) .*? > \K (.*?) (?=</div>)
^ ^ ^ ^ ^ ^ ^
1 2 3 4 5 6 7
<div class=" Look for a literal opening div tag <div, followed by a space, followed by the word class, followed by an equal sign, followed by a quotation mark.
(?!inner) This is a negative lookahead (?!) that makes sure the word inner is not coming up next.
.*? Matches any one character ., zero or more times *, all the way up until it hits the next item in our regular expression ?. In this case, it will stop once it finds a closing HTML bracket.
> Find a closing HTML bracket.
\K This tells the expression to forget everything it has matched so far and start matching again from here. This basically makes sure that the first part of the expression is there, but does not store it for us to work with.
(.*?) Same as number 3, except we use parenthesis () around it so we can capture it and do something with it later.
(?=</div>) This is a positive lookahead (?=) that makes sure the closing div tag </div> is coming up at the end of the expression, but does not capture it.
Here is a working demo of the code above

Related

XPath contains whole word only

I saw the existing question with the same title but that was a different question.
Let's say that I want to find elements that has "conGraph" in the class. I have tried
//div[contains(#class,'conGraph')]
It correctly got
<div class='conGraph mr'>
but it also falsely got
<div class='conGraph_wrap'>
which is not the same class at all. For this case only, I could use 'conGraph ' and get away with it, but I would like to know the general solution for future use.
In short, I want to get elements whose class contains "word" like "word", "word word2" or "word3 word", etc, but not like "words" or "fake_word" or "sword". Is that possible?
One option could be to use 4 conditions (exact term + 3 contains function with whitespace support) :
For the first condition, you search the exact term in the attribute content. For the second, the third and the fourth you specify all the whitespace variants.
Data :
<div class='word'></div>
<div class='word word2'></div>
<div class='word word3'></div>
<div class='swords word'></div>
<div class='swords word words'></div>
<div class='words'></div>
<div class='fake_word'></div>
<div class='sword'></div>
XPath :
//div[#class="word" or contains(#class,"word ") or contains(#class," word") or contains(#class," word ")]
Output :
<div class='word'></div>
<div class='word word2'></div>
<div class='word word3'></div>
<div class='swords word'></div>
<div class='swords word words'></div>

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Select all nodes between two elements excluding unnecessary element from the intersection using XPath

There’s a document structured as follows:
<div class="document">
<div class="title">
<AAA/>
</div class="title">
<div class="lead">
<BBB/>
</div class="lead">
<div class="photo">
<CCC/>
</div class="photo">
<div class="text">
<!-- tags in text sections can vary. they can be `div` or `p` or anything. -->
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="text">
<div class="more_text">
<DDD>
<EEE/>
<DDD/>
<CCC/>
<FFF/>
<FFF>
<GGG/>
</FFF>
</DDD>
</div class="more_text">
<div class="other_stuff">
<DDD/>
</div class="other_stuff">
</div class="document">
The task is to grab all the elements between <div class="lead"> and <div class="other_stuff"> except the <div class="photo"> element.
The Kayessian method for node-set intersection $ns1[count(.|$ns2) = count($ns2)] works perfectly. After substituting $ns1 with //*[#class="lead"]/following::* and $ns2 with //*[#class="other_stuff"]/preceding::*,
the working code looks like this:
//*[#class="lead"]/following::*[count(. | //*[#class="other_stuff"]/preceding::*)
= count(//*[#class="other_stuff"]/preceding::*)]/text()
It selects everything between <div class="lead"> and <div class="other_stuff"> including the <div class="photo"> element. I tried several ways to insert not() selector in the formula itself
//*[#class="lead" and not(#class="photo ")]/following::*
//*[#class="lead"]/following::*[not(#class="photo ")]
//*[#class="lead"]/following::*[not(self::class="photo ")]
(the same things with /preceding::* part) but they don't work. It looks like this not() method is ignored – the <div class="photo"> element remains in the selection.
Question 1: How to exclude the unnecessary element from this intersection?
It’s not an option to select from <div class="photo"> element excluding it automatically because in other documents it can appear in any position or doesn't appear at all.
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
It initially selects everything up to the end and to the beginning of the whole document. Could it be better to specify the exact end point for the following:: and preceding:: ways? I tried //*[#class="lead"]/following::[#class="other_stuff"] but it doesn’t seem to work.
Question 1: How to exclude the unnecessary element from this intersection?
Adding another predicate, [not(self::div[#class='photo'])] in this case, to your working XPath should do. For this particular case, the entire XPath would look like this (formatted for readability) :
//*[#class="lead"]
/following::*[
count(. | //*[#class="other_stuff"]/preceding::*)
=
count(//*[#class="other_stuff"]/preceding::*)
][not(self::div[#class='photo'])]
/text()
Question 2 (additional): Is it OK to use * after following:: and preceding:: in this case?
I'm not sure if it would be 'better', what I can tell is following::[#class="other_stuff"] is invalid expression. You need to mention the element to which the predicate will be applied, for example, 'any element' following::*[#class="other_stuff"], or just 'div' following::div[#class="other_stuff"].

Regex a regexed match in 1 search? Other minor regex questions

I have an email that has some html code that I'm looking to regex. I'm using a gmail gem to read my emails and using nokogiri fails when reading through gmail. Thus I'm looking for a regex solution
What I'd like to do is to scan for the section that is labeled important title and then look at the unordered list within that section, capturing the urls. The html code that is labeled important title is provided below.
I wasn't sure how to do this so I thought the proper way to do it, was to regex for the section called important title and capture everything up to the end of the unordered list. Then within this match, subsequently find the links.
To find the links, I used this regex which works fine: (?:")([^"]*)(?:" )
To capture the section called important title however, I wanted to simply use the following regex (?:important title).*(?:<\/ul>). From my understanding that would look for important title then as many characters as possible, followed by </ul>. However from the below, it only captures </h3>. The new line character is causing it to stop. Which is one of my questions: why is . which is supposed to capture all characters, not capturing a new line character? If that's by design, I don't need more than a simply 'its by design'...
So assuming it's by design, I then tried (?:important title)((.|\s)*)(?:<\/ul>) and that's giving me 2 matches for some reason. The first matches the entire code that I need, stopping at </ul> and the second match is literally just a blank string. I don't get why that's the case...
Finally my last and most important question is, do I need to do 2 regexes to get the links? Or is there a way to combine both regexes so that my "link regex" only searches within my "section regex"?
<h3>the important title </h3>
<ul>
<li><a href="http://www.link.com/23232=
.32434" target="_blank">first link»</a></li>
<li><a href="http://www.link.com/234234468=
.059400" target="_blank">second link »</a></li>
<li><a href="http://www.link.com/287=
.059400" target="_blank">third link»</a></li>
<li><a href="http://www.link.com/4234501=
.059400" target="_blank">fourth link»</a></li>
<li><a href="http://www.link.com/34517=
.059400" target="_blank">5th link»</a></li>
</ul>
An example with nokogiri:
# encoding: utf-8
require 'nokogiri'
html_doc = '''
<h3>the important title </h3>
<ul>
<li>first link»</li>
<li>second link »</li>
<li>third link»</li>
<li>fourth link»</li>
<li>5th link»</li>
</ul>
'''
doc = Nokogiri::HTML.parse(html_doc)
doc.search('//h3[text()="the important title "]/following-sibling::ul[1]/li/a/#href').each do |link|
puts link.content
end
The regex way use the anchor \G that matches the position at the end of the precedent match, since this anchor is initialized to the start of the string at the begining, you must add (?!\A) (not a the start of the string) to forbid this case, and only allow the first match with the second entry point.
To be more readable, all the pattern use the extended mode (or verbose mode, or comment mode, or free-spacing mode...) that allows comments inside the pattern and where spaces are ignored. This mode can be set or unset inline with (?x) and (?-x)
pattern = Regexp.new('
# entry points
(?:
\G (?!\A) # contiguous to the precedent match
|
<h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s*
)
<li>
<a \s+ href=" (?<url> [^"]* ) " [^>]* >
(?<txt> (?> [^<]+ | <(?!/a>) )* )
\s* </a> \s* </li> \s*', Regexp::EXTENDED | Regexp::IGNORECASE)
html_doc.scan(pattern) do |url, txt|
puts "\nurl: #{url}\ntxt: #{txt}"
end
The first match uses the second entry point: <h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s* and all next matches use the second: \G (?!\A)
After the last match, since there is no more contiguous li tags (there is only a closing ul tag), the pattern fails. To succeed again the regex engine will find a new second entry point.
I have html that I'm looking to regex.
Use the nokogiri gem: http://nokogiri.org/
It's the defacto standard for searching html. Ignore the requirements that are listed--they are out of date.
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::HTML(open('http://www.some_site.com'))
html_doc = Nokogiri::HTML(<<'END_OF_HTML')
<h3>not important</h3>
<ul>
<li>first link»</li>
<li>second link »</li>
</ul>
<h3>the important title </h3>
<ul>
<li>first link</li>
<li>second link</li>
<li>third link</li>
<li>fourth link</li>
<li>5th link</li>
</ul>
END_OF_HTML
a_tags = html_doc.xpath(
'//h3[text()="the important title "]/following-sibling::ul[1]//a'
)
a_tags.each do |tag|
puts tag.content
puts tag['href']
end
--output:--
first link
http://www.link.com/23232=.32434
second link
http://www.link.com/234234468=.059400
third link
http://www.link.com/287=.059400
fourth link
http://www.link.com/4234501=.059400
5th link
http://www.link.com/34517=.059400

Remove indentation from code

I'm trying to create a function that removes extraneous starting tabs from code to make it display more neatly. As in, I would like my function to turn this:
<div>
<div>
<p>Blah</p>
</div>
</div>
into this:
<div>
<div>
<p>Blah</p>
</div>
</div>
(The goal of all this is to create a Rails partial into which I can paste formatted code to be displayed in a pre tag justified to the left).
So far, I've got this, but it's erroring, and I don't know why. Never used gsub before, so I'm guessing the problem is there (though the debugging notes also point at the first "end" line):
def tab_stripped(code)
# find number of tabs in first line
char_array = code.split(//)
counter = 0
char_array.each do |c|
counter ++ if c == "\t"
break if c != "\t"
end
# delete that number of tabs from the beginning of each line
start_tabs = ""
counter.times do
start_tabs += "\t"
end
code.gsub!(start_tabs, '')
code
end
Any ideas?
One from my personal library (with minor modifications):
class String
def unindent; gsub(/^#{scan(/^\s+/).min}/, "") end
end
It is more general than what you are asking for. It takes care of not just tabs, but spaces as well, and it does not adjust to the first line, but to the least indented line.
puts <<X.unindent
<div>
<div>
<p>Blah</p>
</div>
</div>
X
gives:
<div>
<div>
<p>Blah</p>
</div>
</div>

Resources