Regex a regexed match in 1 search? Other minor regex questions - ruby

I have an email that has some html code that I'm looking to regex. I'm using a gmail gem to read my emails and using nokogiri fails when reading through gmail. Thus I'm looking for a regex solution
What I'd like to do is to scan for the section that is labeled important title and then look at the unordered list within that section, capturing the urls. The html code that is labeled important title is provided below.
I wasn't sure how to do this so I thought the proper way to do it, was to regex for the section called important title and capture everything up to the end of the unordered list. Then within this match, subsequently find the links.
To find the links, I used this regex which works fine: (?:")([^"]*)(?:" )
To capture the section called important title however, I wanted to simply use the following regex (?:important title).*(?:<\/ul>). From my understanding that would look for important title then as many characters as possible, followed by </ul>. However from the below, it only captures </h3>. The new line character is causing it to stop. Which is one of my questions: why is . which is supposed to capture all characters, not capturing a new line character? If that's by design, I don't need more than a simply 'its by design'...
So assuming it's by design, I then tried (?:important title)((.|\s)*)(?:<\/ul>) and that's giving me 2 matches for some reason. The first matches the entire code that I need, stopping at </ul> and the second match is literally just a blank string. I don't get why that's the case...
Finally my last and most important question is, do I need to do 2 regexes to get the links? Or is there a way to combine both regexes so that my "link regex" only searches within my "section regex"?
<h3>the important title </h3>
<ul>
<li><a href="http://www.link.com/23232=
.32434" target="_blank">first link»</a></li>
<li><a href="http://www.link.com/234234468=
.059400" target="_blank">second link »</a></li>
<li><a href="http://www.link.com/287=
.059400" target="_blank">third link»</a></li>
<li><a href="http://www.link.com/4234501=
.059400" target="_blank">fourth link»</a></li>
<li><a href="http://www.link.com/34517=
.059400" target="_blank">5th link»</a></li>
</ul>

An example with nokogiri:
# encoding: utf-8
require 'nokogiri'
html_doc = '''
<h3>the important title </h3>
<ul>
<li>first link»</li>
<li>second link »</li>
<li>third link»</li>
<li>fourth link»</li>
<li>5th link»</li>
</ul>
'''
doc = Nokogiri::HTML.parse(html_doc)
doc.search('//h3[text()="the important title "]/following-sibling::ul[1]/li/a/#href').each do |link|
puts link.content
end
The regex way use the anchor \G that matches the position at the end of the precedent match, since this anchor is initialized to the start of the string at the begining, you must add (?!\A) (not a the start of the string) to forbid this case, and only allow the first match with the second entry point.
To be more readable, all the pattern use the extended mode (or verbose mode, or comment mode, or free-spacing mode...) that allows comments inside the pattern and where spaces are ignored. This mode can be set or unset inline with (?x) and (?-x)
pattern = Regexp.new('
# entry points
(?:
\G (?!\A) # contiguous to the precedent match
|
<h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s*
)
<li>
<a \s+ href=" (?<url> [^"]* ) " [^>]* >
(?<txt> (?> [^<]+ | <(?!/a>) )* )
\s* </a> \s* </li> \s*', Regexp::EXTENDED | Regexp::IGNORECASE)
html_doc.scan(pattern) do |url, txt|
puts "\nurl: #{url}\ntxt: #{txt}"
end
The first match uses the second entry point: <h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s* and all next matches use the second: \G (?!\A)
After the last match, since there is no more contiguous li tags (there is only a closing ul tag), the pattern fails. To succeed again the regex engine will find a new second entry point.

I have html that I'm looking to regex.
Use the nokogiri gem: http://nokogiri.org/
It's the defacto standard for searching html. Ignore the requirements that are listed--they are out of date.
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::HTML(open('http://www.some_site.com'))
html_doc = Nokogiri::HTML(<<'END_OF_HTML')
<h3>not important</h3>
<ul>
<li>first link»</li>
<li>second link »</li>
</ul>
<h3>the important title </h3>
<ul>
<li>first link</li>
<li>second link</li>
<li>third link</li>
<li>fourth link</li>
<li>5th link</li>
</ul>
END_OF_HTML
a_tags = html_doc.xpath(
'//h3[text()="the important title "]/following-sibling::ul[1]//a'
)
a_tags.each do |tag|
puts tag.content
puts tag['href']
end
--output:--
first link
http://www.link.com/23232=.32434
second link
http://www.link.com/234234468=.059400
third link
http://www.link.com/287=.059400
fourth link
http://www.link.com/4234501=.059400
5th link
http://www.link.com/34517=.059400

Related

Split using multiple keywords using regex

Well I have a string containing (actually without line breaks)
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
and I want an array to store the string which is split using href=" and /"> specifically. How can i do that. I have tried this out.
new_array=my_string.split(/ href=" , \/">/)
Edit:
.split(/href="/)
This works out too good but not with the other part.
.split(/\/">/)
Similarly this works too But i am unable to combine them together into 1 line.
Given this string:
string = <<-HTML
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
HTML
and assuming that the correct link is the one without icon class, you could use the CSS selector a:not(.icon), for example via Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(string)
doc.at_css('a:not(.icon)')[:href]
#=> "/torrent/2349324/some-stuuf-here/"
You can take advantage of lookahead and lookbehind, like this:
my_string.scan(/(?<=href=").*(?=\/">)/)
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array with all occurrences of href=" ... /"> with only the ... part (which can be any string).
Or you can get everything that matches href=".../"> and then remove href=" and the trailing /">, something like this:
my_string.scan(/(?:href=".*\/">)/).map { |e| e.gsub(/(href="|\/">)/, "") }
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array of all instances that match /href=".*\/">/.
How do i split using 2 keywords using regex
You can use a | to denote an or in regex, like this:
my_string.split(/(?:href="|/">)/)

What does <<CONSTANT do? [duplicate]

return <<-HTML
<li>
Link-Title
</li>
HTML
What are <<-HTML on the first line and HTML on the last line for?
It's a heredoc.
http://en.wikipedia.org/wiki/Here_document#Ruby
That's a here document. Basically, it's a multi-line string literal.
On lines after the line with the <<-HTML, those are literal strings concatenated by newlines-- until the end marker is reached, which in this case is HTML.
To explicitly answer the question, this snippet returns the string:
<li>
Link-Title
</li>

preg_match_all skippes one nested tag

if you look at this tag:
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>'
I use this to preg_match:
preg_match_all("'<div class=\"inner\">(.*?)</div>'si", $text, $match); // de ul tags
$match[1] = array_splice($match[0], 0);
foreach($match[1] as $val) // hele pagina
{
echo $val;
}
Well i tried many things, but i only get whats between and never what i need for , what am i doing wrong?
Are you trying to get everything between the beginning and ending div tags? If so, then you're really close. All you'd need to do is just remove the question mark ? from your expression. The question mark tells the script to stop matching once it finds the next item in the REGEX. In this case, the next item is a closing div tag. So once it finds it, it stops. If you leave it out, it will keep matching until it hits the last div tag it can find.
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>';
preg_match_all("'<div class=\"inner\">(.*)</div>'si", $text, $match);
print "<pre><font color=red>"; print_r($match); print "</font></pre>";
If you're trying to pull out each item in a div, then you'd probably want to consider using DOM instead of REGEX to tackle this problem. But since you used the preg-match tag, then here it is in REGEX:
preg_match_all('~<div class="(?!inner).*?>\K(.*?)(?=</div>)~ims', $text, $matches);
print "<PRE><FONT COLOR=BLUE>"; print_r($matches[1]); print "</FONT></PRE>";
That gives you this:
Array
(
[0] =>
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
[1] =>
<span class="red">10:00 - 14:00</span>
)
Explanation of the REGEX:
<div class=" (?!inner) .*? > \K (.*?) (?=</div>)
^ ^ ^ ^ ^ ^ ^
1 2 3 4 5 6 7
<div class=" Look for a literal opening div tag <div, followed by a space, followed by the word class, followed by an equal sign, followed by a quotation mark.
(?!inner) This is a negative lookahead (?!) that makes sure the word inner is not coming up next.
.*? Matches any one character ., zero or more times *, all the way up until it hits the next item in our regular expression ?. In this case, it will stop once it finds a closing HTML bracket.
> Find a closing HTML bracket.
\K This tells the expression to forget everything it has matched so far and start matching again from here. This basically makes sure that the first part of the expression is there, but does not store it for us to work with.
(.*?) Same as number 3, except we use parenthesis () around it so we can capture it and do something with it later.
(?=</div>) This is a positive lookahead (?=) that makes sure the closing div tag </div> is coming up at the end of the expression, but does not capture it.
Here is a working demo of the code above

I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo&params2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]

xpath expression to remove whitespace

I have this HTML:
<tr class="even expanded first>
<td class="score-time status">
<a href="/matches/2012/08/02/europe/uefa-cup/">
16 : 00
</a>
</td>
</tr>
I want to extract the (16 : 00) string without the extra whitespace. Is this possible?
I. Use this single XPath expression:
translate(normalize-space(/tr/td/a), ' ', '')
Explanation:
normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.
translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.
II. Alternatively:
translate(/tr/td/a, '
&#13', '')
Please try the below xpath expression :
//td[#class='score-time status']/a[normalize-space() = '16 : 00']
You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]
I came across this thread when I was having my own issue similar to above.
HTML
<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
<a href="/nsomar/OAStackView/releases/tag/1.0.1">
1.0.1
</a>
XPath start command
tree.xpath('//div[#class="d-flex"]/h4/a/text()')
However this grabbed random whitespace and gave me the output of:
['\n ', '\n 1.0.1\n ']
Using normalize-space, it removed the first blank space node and left me with just what I wanted
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')
['\n 1.0.1\n ']
I could then grab the first element of the list, and use strip() to remove any further whitespace
XPath final command
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()
Which left me with exactly what I required:
1.0.1
you can check if text() nodes are empty.
/path/text()[not(.='')]
it may be useful with axes like following-sibling:: if these are no containers, or with child::.
you can use string() or the regex() function of xpath 2.
NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().
if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.
you can separate node and string manipulation
So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).

Resources