how to get the content between the first html tag and the second html tag in ruby - ruby

Hi friend~
I want get the content between the first html tag and the second html tag.
for example
<p>Hello <bold>world</bold>!</p>
will return
Hello
What should I do in Ruby?
Thank you~

Regex will be: <[^>]*>([^<]*)
<[^>]*> - math thiw first tag "<...>"
([^<]*) - capture text to open next tag "<...> some text <...>"
how apply him on Rubby - i dont know
look http://www.regular-expressions.info/ruby.html

A regular expression catching everthing between the first and the second pair of angle brackets lools like
/<.*?>(.*?)</m
The result will be in the first capturing group of the first match.
Note that this will probably fail on HTML comments and JavaScript.

Related

xpath (//a[contains(., "Test-Test-P24-FA")]) fails to select

This is what I have
1005523 Test-Test-P24-FAID-EUR
I want to write an XPath for the a based on its contained text. I tried the below but it does not work
xpath=(//a[contains(., "Test-Test-P24-FA")])
However, when I try this one, it works.
xpath=(//a[contains(., "Test-Test-P24-FAI")])
Do you have any suggestions?
Thanks!
Try any of this below mentioned xpath.
//a[contains(., 'Test-Test-P24-FAID')]
OR
//a[contains(text(), 'Test-Test-P24-FAID-EUR')]
OR
//a[text()='1005523 Test-Test-P24-FAID-EUR']
Explanation of xpath:- Use text method along with <a> tag.
OR
//a[#class='managed_account_link being_setup'][text()='1005523 Test-Test-P24-FAID-EUR']
Explanation of xpath:- Use class attribute and text method along with <a> tag.
OR
//a[#style='background-color: transparent;'][text()='1005523 Test-Test-P24-FAID-EUR']
Explanation of xpath:- Use style attribute and text method along with <a> tag.

Regex encapsulate full line and surround it

I can find examples of surrounding a line but not surrounding and replacing, and I'm a bit new to Regex.
I'm trying to ease up my markdown, so that I do not need to add in html just to get it to center images.
With pandoc, I apparently need to surround and image with DIV tags to get it to be centered, right justified, or what ever.
Instead of typing that every time, I'd like to just preprocess my markdown with a ruby script and have ruby add in the DIV's for me.
So I can type:
center![](image.jpg)
and then run a ruby script that will change it to
<div class="center">
![](image.jpg)
</div>
I want the regex to find "center!" and get rid of the word "center" and surround the rest with DIV tags.
How would I accomplish this?
A little example using gsub:
s = "a\ncenter![](image.jpg)\nb\n"
puts s.gsub(/^center(.*)$/, "<div class=\"center\">\n\\1\n</div>")
Result is:
a
<div class="center">
![](image.jpg)
</div>
b
Should get you started. The (.*) captures the content after center, and \\1 adds it back into the replacement. In this example I assumed that the item was on a line by itself - ^ indicates the start of a line and $ indicates the end of a line. If that isn't the case, you'll need to determine what makes what your regex unique so that it doesn't replace any random usage of "center" in your text.

scrapy: Remove elements from an xpath selector

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]

How do I find matching <pre> tags using a reqular expression?

I am trying to create a simple blog that has code inclosed in <pre> tags.
I want to display "read more" after the first closing </pre> tag is encountered, thus showing only the first code segment.
I need to display all text, HTML, code up to the first closing </pre> tag.
What I've come up with so far is the follow:
/^(.*<\/pre>).*$/m
However, this matches every closing </pre> tag up to the last one encountered.
I thought something like the following would work:
/^(.*<\/pre>{1}).*$/m
It of course does not.
I've been using Rubular.
My solution thanks to your guys help:
require 'nokogiri'
module PostsHelper
def readMore(post)
doc = Nokogiri::HTML(post.message)
intro = doc.search("div[class='intro']")
result = Nokogiri::XML::DocumentFragment.parse(intro)
result << link_to("Read More", post_path(post))
result.to_html
end
end
Basically in my editor for the blog I wrap the blog preview in div class=intro
Thus, only the intro is displayed with read more added on to it.
This is not a job for regular expressions, but for a HTML/XML parser.
Using Nokogiri, this will return all <pre> blocks as HTML, making it easy for you to grab the one you want:
require 'nokogiri'
html = <<EOT
<html>
<head></head>
<body>
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
pre_blocks = doc.search('pre')
puts pre_blocks.map(&:to_html)
Which will output:
<pre><p>block 1</p></pre>
<pre><p>block 2</p></pre>
You can capture all text upto the first closing pre tag by modifying your regular expression to,
/^(.*?<\/pre>{1}).*$/m
This way you can get the matched text by,
text.match(regex)[1]
which will return only the text upto the first closing pre tag.
Reluctant matching might help in your case:
/^(.*?<\/pre>).*$/m
But it's probably not the best way to do the thing, consider using some html parser, like Nokogiri.

put each text surrounded via html tag, into an array?

using nokogiri,
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s
this does the job, however, it puts everything into one flat text.
i need to take each text surrounded via html tags
<b> text</b>
<h1>text3</b>
and put them into array. ["text", "text3"]
what is the recommended action ?
i thought of doing
doc.xpath("*").text
but dont know how to iterate through it all.
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_a

Resources