replacing html tag and its content using ruby gsub - ruby

I am trying to replace a <p>..</p> tag content in html content with empty string by doing the following.
string = \n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "
When I did
string.gsub!(/<p.*?>|<\/p>/, '')
It just replaced the <p> and </p> with empty string but the content remained. How can I remove both the tag and its content ?

Apparently, your regex does not match <p>...</p> (<p> and its content). Try this:
string.gsub!(/<p>.*<\/p>/, '')
test = '\n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "'
test.gsub(/<p>.*<\/p>/, '')
Return
"\\n <img alt=\\\"testing artice breaking news\\\" src=\\\"something.com\" />\\n \\n \""
Also, please consider #Tom Lord's comment, you can use Nokogiri to manipulate HTML.

First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.
If you want to do it with a regex, you can use
string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
See the Rubular regex demo. This will work with tags that cannot be nested. Details:
<p(?:\s[^>]*)?> - <p, and an optional sequence of a whitespace and zero or more chars other than > (as many as possible), and then >
.*? - due to /m, any zero or more chars as few as possible
<\/p> - </p> string.
If the tags can be nested, you still can use a regex:
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
See the Rubular regex demo. Details:
<#{tagname} - < and tag name
(?:\s[^>]*)?> - an optional sequence of whitespace and then zero or more chars other than <
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)* - zero or more occurrences of
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)* - zero or more chars other than < and then zero or more sequences of < that is not followed with tag name + > or whitespace or / + tag name + > followed with zero or more chars other than < chars
|
\g<0> - the whole regex pattern recursed
<\/#{tagname}> - </ + tag name + >.
See a Ruby demo:
string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

Related

Xpath text between tags

Any idea how i would get the text between 2 tags using Xpath code? specifically the 3, bd, 1, ba.
<p class="MuiTypography-root RoofCard__RoofCardNameStyled-niegej-8 hukPZu MuiTypography-body1" xpath="1">
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md">$65,000</span></p>
**"3" == $0
" bd, " == $0
"1" == $0
" ba | " == $0**
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md" xpath="1">926</span>
tried:
In fact from your sample that's a simple text() node after p:
//p/following-sibling::text()[1]
but of course you'll need to parse it. This will return almost that you need:
values = response.xpath('//p/following-sibling::text()[1]').re(r'"([^"]+)"')

RegEx code works in theory but not when code is run

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.
my code:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end
some of what i'm searching through:
<div class="ms3">
<span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
</div>
<div class="Paragraph">
<span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
</div>
<div class="Stanza_Break"></div>
The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else
Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).
You should have better luck reading the whole text as a block, and using match on it:
File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span>
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span>
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18);
# => the futility of offering sacrifices unmatched by common justice is once more
# => underlined, and exile seems certain (5.21–27).</span></span>\n </div>" 1:"\n ">

Scrapy can't find XPath content

I'm writing a web crawler with Scrapy to download the text of talk-backs on a certain webpage.
Here is the relevant part of the code behind the webpage, for a specific talkback:
<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......
While writing an XPath to get the the message:
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
and later on:
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
There's no bug, but it doesn't work. Any ideas why? I suppose I'm not writing the path correctly, but I can't find the error.
Thank you :)
The whole code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
items.append(item)
return items
Here's a snippet of the HTML page for #site_comment_74240
<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
144. מדיניות
</div>
<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי </td>
<td>(01.11.2013)</td>
</tr></table>
</div>
The "talkback-message" div is not in the HTML page when you first fetch it, but rather is fetched asynchronously via some AJAX query when you click on a comment title, so you'll have to fetch it for each comment.
Comment blocks, titles in you code snipper, can be grabbed using an XPath like this: //div[starts-with(#id, "site_comment_"]), i.e. all divs that have an "id" attribute beginning with string ""site_comment_"
You can also use CSS selectors with Selector.css(). In your case, you can grab comment blocks using either the "id" approach (as I've done above using XPath), so:
titles = sel.css("div[id^=site_comment_]")
or using the "site_comment" class without the other "site_comment-even", "site_comment-odd", "small", "normal-rank" or "high-rank" that vary:
titles = sel.css("div.site_comment")
Then you would issue a new Request using the URL that's in ./div[#class="talkback-topic"]/a[#class="show-comment"]/#data-ajax-url inside that comment div. Or using CSS selectors, div.talkback-topic > a.show-comment::attr(data-ajax-url) (by the way, the ::attr(...) is not standard, but is a Scrapy extension to CSS selectors using pseudo elements functions)
What you get from the AJAX call is some Javascript code, and you want to grab the content inside old.after(...)
var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל <\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&html_id=site_comment_72765&sibling_id=72765\">תגובה חדשה<\/a>\n \n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n \n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");
This is HTML data that you'll need to parse again using something Selector(text=this_ajax_html_data) and a .//div[#class="talkback-message"]//text() XPath or div.talkback-message ::text CSS selector
Here's a skeleton spider to get you going with these ideas:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[#class='talkback-message']text()").extract()
# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break
# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")
# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")
# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item
You can try it out using scrapy runspider tbkspider.py.

clean/sanitize HTML, but preserve loses HTML chars with Ruby/Rails + Nokogiri + Sanitize (?)

We were using a combination of the Sanitize gem and HTMLEntities to do some clean up of user input HTML. The Sanitize gem used Hpricot, but now uses Nokogiri. I need to get Hpricot out of the app.
Here are two test strings, each followed by the output I'm expecting:
Test string 1:
"SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>"
expected_text = "SOME TEXT < 'MORE' & TEXT!!!"
Second test string (a slightly different path):
'Support <i>odd</i> chars like " < \' ‽'
expected_text = 'Support <i>odd</i> chars like " < ' ‽'
Is this something you've solved? What tools did you use?
You may want to try the Loofah gem:
Loofah.document("SOME TEXT < '<span style='background-image: url(\"http://evil.ru/webbug.png\")'>MORE' & TEXT!!!</span>").to_html
=> "SOME TEXT MORE' & TEXT!!!"
Loofah isn't handling the unicode character in the second example for some reason, but I'd be happy to look into it if you file a Github Issue on Loofah (full disclosure: I'm the author of Loofah and co-author of Nokogiri).
Some more links:
http://rubydoc.info/github/flavorjones/loofah/master/frames
https://github.com/flavorjones/loofah

How to filter certain words from selected text using XPath?

To select the text here:
Alpha Bravo Charlie Delta Echo Foxtrot
from this HTML structure:
<div id="entry-2" class="item-asset asset hentry">
<div class="asset-header">
<h2 class="asset-name entry-title">
<a rel="bookmark" href="http://blahblah.com/politics-democrat">Pelosi Q&A</a>
</h2>
</div>
<div class="asset-content entry-content">
<div class="asset-body">
<p>Alpha Bravo Charlie Delta Echo Foxtrot</p>
</div>
</div>
</div>
I apply following XPath expression to select the text inside asset-body:
//div[contains(
div/h2[
contains(concat(' ',#class,' '),' asset-name ')
and
contains(concat(' ',#class,' '),' entry-title ')
]/a[#rel='bookmark']/#href
,'democrat')
]/div/div[
contains(concat(' ',#class,' '),' asset-body ')
]//text()
How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta
With XPath 1.0 supposing uniques NMTokens:
concat(substring-before(concat(' ',$Node,' '),' Alpha '),
substring-after(concat(' ',$Node,' '),' Alpha '))
As you can see, this becomes very verbose (and bad performance).
With XPath 2.0:
string-join(tokenize($Node,' ')[not(.=('Alpha','Charlie','Echo'))],' ')
How would I sanitize the following words from the text:
Alpha
Charlie
Echo
So that I end up with only the following text in this example:
Bravo Delta
This can't be done in XPath 1.0 alone -- you'll need to get the text in the host language and do the replacement there.
In XPath 2.0 one can use the replace() function:
replace(replace(replace($vText, ' Alpha ', ''), ' Charlie ', ''), ' Echo ')

Resources