Delete regex phrase from text - ruby

I have a text with news where i got html attributes that i don't need. How can i delete phrases in ruby such as
img width="750" alt="4.jg" c="/unload/medialiy/df6/4.jg" height="499"
title=4.jg"
img width="770" alt="5.jg" c="/unload/medialiy/ty6/5.jg"
height="499" title=5.jg"
So i need some regex smth like news.sub('/img*jg"/, ''). but it doesn't work.

I would use:
img .*\.jg"
test
if you want to say in regex "any symbols in any quantity", use .* Dot means any symbol, and star - any quantity.
But are you sure you don't want to include angle braces?
<img .*\.jg">
As an aside, what if the order of attributes will be changed? Then you'll fail to match the img tag. We really need img tag with .jg" substring in it.
<img [^>]*\.jg"[^>]*>
test

In your particular case you can do this:
element = '<img width="750" alt="4.jg" c="/unload/medialiy/df6/4.jg" height="499" title="4.jg">'
puts element.gsub(/(width|alt)=\"[^ ]+\" ?/, '')
You can also play around with this regex here.
But if you need a more robust solution, try to take a look at the Nokogiri gem. This SO question can help.

Related

CSS selector for a single attribute out of multiple attributes

how do you select a single attribute within the element
<img src="xyz.jpg" title="xyz" alt="xyz">
just need the img src
seems to be an overlooked question
as all assumed implementations yield the entire tag still
Assuming your element is actually something like
<img src="xyz.jpg" title="xyz" alt="xyz">yo!</img>
the xpath expression
//img/#src
Should get you xyz.jpg.

Regex encapsulate full line and surround it

I can find examples of surrounding a line but not surrounding and replacing, and I'm a bit new to Regex.
I'm trying to ease up my markdown, so that I do not need to add in html just to get it to center images.
With pandoc, I apparently need to surround and image with DIV tags to get it to be centered, right justified, or what ever.
Instead of typing that every time, I'd like to just preprocess my markdown with a ruby script and have ruby add in the DIV's for me.
So I can type:
center![](image.jpg)
and then run a ruby script that will change it to
<div class="center">
![](image.jpg)
</div>
I want the regex to find "center!" and get rid of the word "center" and surround the rest with DIV tags.
How would I accomplish this?
A little example using gsub:
s = "a\ncenter![](image.jpg)\nb\n"
puts s.gsub(/^center(.*)$/, "<div class=\"center\">\n\\1\n</div>")
Result is:
a
<div class="center">
![](image.jpg)
</div>
b
Should get you started. The (.*) captures the content after center, and \\1 adds it back into the replacement. In this example I assumed that the item was on a line by itself - ^ indicates the start of a line and $ indicates the end of a line. If that isn't the case, you'll need to determine what makes what your regex unique so that it doesn't replace any random usage of "center" in your text.

scrapy: Remove elements from an xpath selector

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]

how to get the content between the first html tag and the second html tag in ruby

Hi friend~
I want get the content between the first html tag and the second html tag.
for example
<p>Hello <bold>world</bold>!</p>
will return
Hello
What should I do in Ruby?
Thank you~
Regex will be: <[^>]*>([^<]*)
<[^>]*> - math thiw first tag "<...>"
([^<]*) - capture text to open next tag "<...> some text <...>"
how apply him on Rubby - i dont know
look http://www.regular-expressions.info/ruby.html
A regular expression catching everthing between the first and the second pair of angle brackets lools like
/<.*?>(.*?)</m
The result will be in the first capturing group of the first match.
Note that this will probably fail on HTML comments and JavaScript.

put each text surrounded via html tag, into an array?

using nokogiri,
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s
this does the job, however, it puts everything into one flat text.
i need to take each text surrounded via html tags
<b> text</b>
<h1>text3</b>
and put them into array. ["text", "text3"]
what is the recommended action ?
i thought of doing
doc.xpath("*").text
but dont know how to iterate through it all.
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_a

Resources