Ruby - remove part of the text - ruby

I have a text similar to this:
<p>some text ...</p><p>The post text... appeared first on some another text.</p>
I need to remove everything from <p>The post, so the results would be:
<p>some text ...</p>
I am trying ot do that this way:
text.sub!(/^<p>The post/, '')
But it returns just an empty string... how to fix that?

Your regex is incorrect. It matches every <p>The post that is in the beginning of the string. You want the opposite: match from its position to the end of the string. Check this out.
s = '<p>some text ...</p><p>The post text... appeared first on some another text.</p>'
s.sub(/<p>The\spost.*$/, '') # => "<p>some text ...</p>"

You have specified ^, which matches the beginning of a string. You should do
text.sub!(/<p>The post.*$/, '')
Play with this in http://rubular.com/r/c91EbHN0Af

'^' is matching the beginning of the whole string. try doing
text.sub!(/<p>The post/, '')
EDIT just read it more carefully...
text.sub!(/<p>The post.*$/, '')

Related

Replacing <a> tags that have two pairs of double quotes

I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?
I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]
Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.

I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo&params2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]

xpath: Picking tag after text

How would one, via xpath, select the strong tag after baz text for example?
<p>
<br>foo<strong>this foo</strong>
<br>bar<strong>this bar</strong>
<br>baz<strong>this baz</strong>
<br>qux<strong>this qux</strong></p>
Obviously the following does not work....
//p[text() = 'baz']/following-sibling::select[1]
Try this
//p/text()[. = 'baz']/following-sibling::strong[1]
Demo here - http://www.xpathtester.com/obj/b67bad4d-4d38-4e2d-a3df-b7e5a2e9f286
This solution relies on no whitespace around your text nodes. You will need to switch to using the following if you start using indentation or other whitespace characters
//p/text()[normalize-space(.) = 'baz']/following-sibling::strong[1]

how can I make xpath match whitespace?

I can match sometext and othertext in
<br>
sometext
<br>
othertext
using xpath selector '//br/following-sibling::text()'
but if there is only whitespace after the <br> element
<br>
<br>
othertext
only the second match occurs. Is it possible to match whitespace as well?
I tried
//br/following-sibling::matches(., "\s+")
to attempt to match whitespace without success.
'matches' is to match regular-expressions, not to match nodes. And it can't be used with an axis specifier. You could use it as condition like:
//br/following-sibling::text()[matches(., "\s+")]
Or without regexs (might be faster depending on the implementation), checking if it is all whitespace and not the empty string:
//br/following-sibling::text()[(normalize-space(.) = "") and (. != "")]

To get text after the tag, containing another text

For example:
<p>
<b>Member Since:</b> Aug. 07, 2010<br><b>Time Played:</b> <span class="text_tooltip" title="Actual Time: 15.09:37:06">16 days</span><br><b>Last Game:</b>
<span class="text_tooltip" title="07/16/2011 23:41">1 minute ago</span>
<br><b>Wins:</b> 1,017<br><b>Losses / Quits:</b> 883 / 247<br><b>Frags / Deaths:</b> 26,955 / 42,553<br><b>Hits / Shots:</b> 690,695 / 4,229,566<br><b>Accuracy:</b> 16%<br>
</p>
I want to get 1,017. It is a text after the tag, containing text Wins:.
If I used regex, it would be [/<b>Wins:<\/b> ([^<]+)/,1], but how to do it with Nokogiri and XPath?
Or should I better parse this part of page with regex?
Here
doc = Nokogiri::HTML(html)
puts doc.at('b[text()="Wins:"]').next.text
You can use this XPath: //*[*/text() = 'Wins:']/text() It will return 1,017.
About regex: RegEx match open tags except XHTML self-contained tags
I would use pure XPath like:
"//b[.='Wins:']/following::node()[1]"
I've heard thousand of times (and from gurus) "never use regex to parse XML". Can you provide some "shocking" reference demonstrating that this sentence is not valid any more?
Use:
//*[. = 'Wins:']/following-sibling::node()[1]
In case this is ambiguous (selects more than one node), more strict expressions can be specified:
//*[. = 'Wins:']/following-sibling::node()[self::text()][1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[self::text()][1]

Resources