Split using multiple keywords using regex - ruby

Well I have a string containing (actually without line breaks)
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
and I want an array to store the string which is split using href=" and /"> specifically. How can i do that. I have tried this out.
new_array=my_string.split(/ href=" , \/">/)
Edit:
.split(/href="/)
This works out too good but not with the other part.
.split(/\/">/)
Similarly this works too But i am unable to combine them together into 1 line.

Given this string:
string = <<-HTML
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
HTML
and assuming that the correct link is the one without icon class, you could use the CSS selector a:not(.icon), for example via Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(string)
doc.at_css('a:not(.icon)')[:href]
#=> "/torrent/2349324/some-stuuf-here/"

You can take advantage of lookahead and lookbehind, like this:
my_string.scan(/(?<=href=").*(?=\/">)/)
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array with all occurrences of href=" ... /"> with only the ... part (which can be any string).
Or you can get everything that matches href=".../"> and then remove href=" and the trailing /">, something like this:
my_string.scan(/(?:href=".*\/">)/).map { |e| e.gsub(/(href="|\/">)/, "") }
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array of all instances that match /href=".*\/">/.
How do i split using 2 keywords using regex
You can use a | to denote an or in regex, like this:
my_string.split(/(?:href="|/">)/)

Related

How to split by HTML tags using a regex

I have a string like this:
"Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39"
and I want to split it by its HTML tags, which are always <span>. I tried something like:
my_string.split(/<span(.*)span>/)
but it didn't work, it only matched the first element correctly.
Does anyone know what is wrong with my regex? In this example, I expected the returned value to be:
["Energia Elétrica kWh", "10.942", "0,74999294" ,"8.206,39"]
I would like something like strip_tags, but instead of returning the string sanitized, get the array split by the tags removed.
Don't use a pattern to manipulate HTML. It's a path destined to make you insane.
Instead use a HTML parser. The standard for Ruby is Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39")
You could use text to extract all the text, but, if it's structured data you're after, that often makes it difficult to extract the fields because the text nodes can be concatenated resulting in run-on words, so be careful there:
doc.text # => "Energia Elétrica kWh 10.942 0,74999294 8.206,39"
Instead we typically extract the data from individual nodes:
doc.search('span')[1].next_sibling.text # => " 0,74999294 "
doc.search('span').last.next_sibling.text # => " 8.206,39"
Or, we iterate over the nodes, then use map to grab the node's text:
doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["10.942", "0,74999294", "8.206,39"]
I'd go about the problem like this:
data = [doc.at('span').previous_sibling.text.strip] # => ["Energia Elétrica kWh"]
data += doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
Or:
spans = doc.search('span')
data = [
spans.first.previous_sibling.text,
*spans.map{ |span| span.next_sibling.text }
].map(&:strip)
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
While a regular expression can often work on an initial attempt, a change in the format of the HTML can break the pattern, forcing an additional change, then another change, and then another, until the pattern is too convoluted, whereas a properly written parser approach will typically be very resilient and immune to the problem.
If you really need to use regex to do this, you pretty much had it already.
irb(main):010:0> string.split(/<span.+?span>/)
=> ["Energia Eltrica kWh", " 10.942 ", " 0,74999294 ", " 8.206,39"]
You just needed the ? to tell it to match as little as possible.

xpath getting the name in a certain pattern

I want to get a class name like the following:
class="hostHostGrid0_body"
The integer in between hostHostGrid and _body can change, but everything else I want it just like that in the order.
How can I achieve this?
In XPath 1.0 you can use this:
//*[starts-with(#class,'hostHostGrid') and substring-after(#class,'_') = 'body']
to select any element containing one class. It will match tags in any context. It will match all three elements below:
<div class="hostHostGrid0_body">
<span class="hostHostGrid123_body"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
Limitations: it doesn't restrict what is between them to a number. It can be anything, including spaces (ex: it will also match this: class="hostHostGrid xyz abc_body")
This one allows for the class occurring among other classes:
//*[contains(substring-before(#class,'_body'),'hostHostGrid')]
It will match:
<div class="other-class hostHostGrid0_body">
<span class="hostHostGrid123_body other-class"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
(it also has the same limitations - will match anything between 'hostHostGrid' and '_body')

I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo&params2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]

xpath expression to remove whitespace

I have this HTML:
<tr class="even expanded first>
<td class="score-time status">
<a href="/matches/2012/08/02/europe/uefa-cup/">
16 : 00
</a>
</td>
</tr>
I want to extract the (16 : 00) string without the extra whitespace. Is this possible?
I. Use this single XPath expression:
translate(normalize-space(/tr/td/a), ' ', '')
Explanation:
normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.
translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.
II. Alternatively:
translate(/tr/td/a, '
&#13', '')
Please try the below xpath expression :
//td[#class='score-time status']/a[normalize-space() = '16 : 00']
You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]
I came across this thread when I was having my own issue similar to above.
HTML
<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
<a href="/nsomar/OAStackView/releases/tag/1.0.1">
1.0.1
</a>
XPath start command
tree.xpath('//div[#class="d-flex"]/h4/a/text()')
However this grabbed random whitespace and gave me the output of:
['\n ', '\n 1.0.1\n ']
Using normalize-space, it removed the first blank space node and left me with just what I wanted
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')
['\n 1.0.1\n ']
I could then grab the first element of the list, and use strip() to remove any further whitespace
XPath final command
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()
Which left me with exactly what I required:
1.0.1
you can check if text() nodes are empty.
/path/text()[not(.='')]
it may be useful with axes like following-sibling:: if these are no containers, or with child::.
you can use string() or the regex() function of xpath 2.
NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().
if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.
you can separate node and string manipulation
So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).

How to parse only part of a string-value from an element using Nokogiri? RUBY, Mechanize

How do I extract numbers off a string ?
if xpath is 'td[5]p/#title'
HTML :
<td valign="top" align="center">
<p title="6 en su sucursal" style="margin-top: 0px; margin-bottom:0px; cursor:hand">
<b>10</b>
</p>
</td>
I need to extract from the title attribute string-value "6 en su sucusal" only number 6
Give some HTML inside html, you'd do something like this:
doc = Nokogiri::HTML(html)
numbers = doc.xpath('//p[#title]').collect { |p| p[:title].gsub(/[^\d]/, '') }
Then you'll have the numbers in the numbers array. You'll have to adjust the XPath and regular expression to match your real data of course but the basic technique should be clear.
A bit of time with the Nokogiri documentation and tutorials might be fruitful.

Resources