return <<-HTML
<li>
Link-Title
</li>
HTML
What are <<-HTML on the first line and HTML on the last line for?
It's a heredoc.
http://en.wikipedia.org/wiki/Here_document#Ruby
That's a here document. Basically, it's a multi-line string literal.
On lines after the line with the <<-HTML, those are literal strings concatenated by newlines-- until the end marker is reached, which in this case is HTML.
To explicitly answer the question, this snippet returns the string:
<li>
Link-Title
</li>
Related
I have an email that has some html code that I'm looking to regex. I'm using a gmail gem to read my emails and using nokogiri fails when reading through gmail. Thus I'm looking for a regex solution
What I'd like to do is to scan for the section that is labeled important title and then look at the unordered list within that section, capturing the urls. The html code that is labeled important title is provided below.
I wasn't sure how to do this so I thought the proper way to do it, was to regex for the section called important title and capture everything up to the end of the unordered list. Then within this match, subsequently find the links.
To find the links, I used this regex which works fine: (?:")([^"]*)(?:" )
To capture the section called important title however, I wanted to simply use the following regex (?:important title).*(?:<\/ul>). From my understanding that would look for important title then as many characters as possible, followed by </ul>. However from the below, it only captures </h3>. The new line character is causing it to stop. Which is one of my questions: why is . which is supposed to capture all characters, not capturing a new line character? If that's by design, I don't need more than a simply 'its by design'...
So assuming it's by design, I then tried (?:important title)((.|\s)*)(?:<\/ul>) and that's giving me 2 matches for some reason. The first matches the entire code that I need, stopping at </ul> and the second match is literally just a blank string. I don't get why that's the case...
Finally my last and most important question is, do I need to do 2 regexes to get the links? Or is there a way to combine both regexes so that my "link regex" only searches within my "section regex"?
<h3>the important title </h3>
<ul>
<li><a href="http://www.link.com/23232=
.32434" target="_blank">first link»</a></li>
<li><a href="http://www.link.com/234234468=
.059400" target="_blank">second link »</a></li>
<li><a href="http://www.link.com/287=
.059400" target="_blank">third link»</a></li>
<li><a href="http://www.link.com/4234501=
.059400" target="_blank">fourth link»</a></li>
<li><a href="http://www.link.com/34517=
.059400" target="_blank">5th link»</a></li>
</ul>
An example with nokogiri:
# encoding: utf-8
require 'nokogiri'
html_doc = '''
<h3>the important title </h3>
<ul>
<li>first link»</li>
<li>second link »</li>
<li>third link»</li>
<li>fourth link»</li>
<li>5th link»</li>
</ul>
'''
doc = Nokogiri::HTML.parse(html_doc)
doc.search('//h3[text()="the important title "]/following-sibling::ul[1]/li/a/#href').each do |link|
puts link.content
end
The regex way use the anchor \G that matches the position at the end of the precedent match, since this anchor is initialized to the start of the string at the begining, you must add (?!\A) (not a the start of the string) to forbid this case, and only allow the first match with the second entry point.
To be more readable, all the pattern use the extended mode (or verbose mode, or comment mode, or free-spacing mode...) that allows comments inside the pattern and where spaces are ignored. This mode can be set or unset inline with (?x) and (?-x)
pattern = Regexp.new('
# entry points
(?:
\G (?!\A) # contiguous to the precedent match
|
<h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s*
)
<li>
<a \s+ href=" (?<url> [^"]* ) " [^>]* >
(?<txt> (?> [^<]+ | <(?!/a>) )* )
\s* </a> \s* </li> \s*', Regexp::EXTENDED | Regexp::IGNORECASE)
html_doc.scan(pattern) do |url, txt|
puts "\nurl: #{url}\ntxt: #{txt}"
end
The first match uses the second entry point: <h3> \s* (?-x)the important title(?x) \s* </h3> \s* <ul> \s* and all next matches use the second: \G (?!\A)
After the last match, since there is no more contiguous li tags (there is only a closing ul tag), the pattern fails. To succeed again the regex engine will find a new second entry point.
I have html that I'm looking to regex.
Use the nokogiri gem: http://nokogiri.org/
It's the defacto standard for searching html. Ignore the requirements that are listed--they are out of date.
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::HTML(open('http://www.some_site.com'))
html_doc = Nokogiri::HTML(<<'END_OF_HTML')
<h3>not important</h3>
<ul>
<li>first link»</li>
<li>second link »</li>
</ul>
<h3>the important title </h3>
<ul>
<li>first link</li>
<li>second link</li>
<li>third link</li>
<li>fourth link</li>
<li>5th link</li>
</ul>
END_OF_HTML
a_tags = html_doc.xpath(
'//h3[text()="the important title "]/following-sibling::ul[1]//a'
)
a_tags.each do |tag|
puts tag.content
puts tag['href']
end
--output:--
first link
http://www.link.com/23232=.32434
second link
http://www.link.com/234234468=.059400
third link
http://www.link.com/287=.059400
fourth link
http://www.link.com/4234501=.059400
5th link
http://www.link.com/34517=.059400
I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo¶ms2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo¶ms2=bar">
http://someurl.com?param1=foo¶ms2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]
I have a field in my table like:
tank, troublesome, athletic, powerback
That's a single string.
I'd like to comma separate these values and place them as CSS classes for an element. Here's my attempt:
<a href="<%= player_path(player) %>" class="player-item <% players.roles.split(",").each do |role| print role end %>">
But I get:
<a href="/players/1" class="player-item ">
Any suggestions?
You can just replace the commas with nothing:
<a href="<%= player_path(player) %>" class="player-item <% player.roles.gsub(',', '') %>">
But I think there is a better solution that would involve refactoring your database. Having fields of comma delimited values is almost never a good idea :) but that is a different story.
I have a text similar to this:
<p>some text ...</p><p>The post text... appeared first on some another text.</p>
I need to remove everything from <p>The post, so the results would be:
<p>some text ...</p>
I am trying ot do that this way:
text.sub!(/^<p>The post/, '')
But it returns just an empty string... how to fix that?
Your regex is incorrect. It matches every <p>The post that is in the beginning of the string. You want the opposite: match from its position to the end of the string. Check this out.
s = '<p>some text ...</p><p>The post text... appeared first on some another text.</p>'
s.sub(/<p>The\spost.*$/, '') # => "<p>some text ...</p>"
You have specified ^, which matches the beginning of a string. You should do
text.sub!(/<p>The post.*$/, '')
Play with this in http://rubular.com/r/c91EbHN0Af
'^' is matching the beginning of the whole string. try doing
text.sub!(/<p>The post/, '')
EDIT just read it more carefully...
text.sub!(/<p>The post.*$/, '')
<div id="Dossuuu11Plus" style="display: block; ">
Text need
<br/>
Not need
<a class="bot_link" href="http://abc.com" target="_self">http://abc.com</a>
<br/>
</div>
This is html code. I use: //td[#class='textdetaildrgI
but it get all content in , I just need "Text need". Please help me. Thanks
You could use
//div[#id='Dossuuu11Plus']/text()[1][normalize-space()]
Explanation:
It will select the first text node found for DIV which in this case is Text need and normalize-space() will trim leading and trailing whitespaces if any.