How to split by HTML tags using a regex - ruby

I have a string like this:
"Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39"
and I want to split it by its HTML tags, which are always <span>. I tried something like:
my_string.split(/<span(.*)span>/)
but it didn't work, it only matched the first element correctly.
Does anyone know what is wrong with my regex? In this example, I expected the returned value to be:
["Energia Elétrica kWh", "10.942", "0,74999294" ,"8.206,39"]
I would like something like strip_tags, but instead of returning the string sanitized, get the array split by the tags removed.

Don't use a pattern to manipulate HTML. It's a path destined to make you insane.
Instead use a HTML parser. The standard for Ruby is Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse("Energia Elétrica kWh<span class=\"_ _3\"> </span> 10.942 <span class=\"_ _4\"> </span> 0,74999294 <span class=\"_ _5\"> </span> 8.206,39")
You could use text to extract all the text, but, if it's structured data you're after, that often makes it difficult to extract the fields because the text nodes can be concatenated resulting in run-on words, so be careful there:
doc.text # => "Energia Elétrica kWh 10.942 0,74999294 8.206,39"
Instead we typically extract the data from individual nodes:
doc.search('span')[1].next_sibling.text # => " 0,74999294 "
doc.search('span').last.next_sibling.text # => " 8.206,39"
Or, we iterate over the nodes, then use map to grab the node's text:
doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["10.942", "0,74999294", "8.206,39"]
I'd go about the problem like this:
data = [doc.at('span').previous_sibling.text.strip] # => ["Energia Elétrica kWh"]
data += doc.search('span').map{ |span| span.next_sibling.text.strip }
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
Or:
spans = doc.search('span')
data = [
spans.first.previous_sibling.text,
*spans.map{ |span| span.next_sibling.text }
].map(&:strip)
# => ["Energia Elétrica kWh", "10.942", "0,74999294", "8.206,39"]
While a regular expression can often work on an initial attempt, a change in the format of the HTML can break the pattern, forcing an additional change, then another change, and then another, until the pattern is too convoluted, whereas a properly written parser approach will typically be very resilient and immune to the problem.

If you really need to use regex to do this, you pretty much had it already.
irb(main):010:0> string.split(/<span.+?span>/)
=> ["Energia Eltrica kWh", " 10.942 ", " 0,74999294 ", " 8.206,39"]
You just needed the ? to tell it to match as little as possible.

Related

How to retrieve string using XPath without returning null errors

I'm trying to write "Private Equity Group; USA" to a file.
"Private Equity Group" prints fine, but I get an error for the "USA" portion
TypeError: null is not an object (evaluating 'style.display')"
HTML code:
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
I get the error when I print the XPath or have it in an if statement:
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]/text()')){
file.puts "#{internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]/text()')}"
}
Capybara is not a general purpose xpath library - it is a library aimed at testing, and therefore is element centric. The xpaths used need to refer to elements, not text nodes.
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]')){
file.puts internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]').text
}
although using XPath at all for this is just a bad idea. Whenever possible default to CSS, it's easier to read, and faster for the browser to process - something like
if (internet.has_css?('#addrDiv-Id > div > div:nth-of-type(3)')){
file.puts internet.find('#addrDiv-Id" > div > div:nth-of-type(3)').text
}
or if the HTML allows it (I don't know without seeing more of the HTML)
if (internet.has_css?('#addrDiv-id .cl.profile-xsmall')){
file.puts internet.find('#addrDiv-id .cl.profile-xsmall').text
}
or even cleaner if it works for your use case
file.puts internet.first('#addrDiv-id .cl.profile-xsmall')&.text
Another way to do it :
xml = %{<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA</div>}
require 'rexml/document'
doc = REXML::Document.new xml
print(REXML::XPath.match(doc, 'normalize-space(string(//div[#class="cl profile-xsmall"]))'))
Output :
["Private Equity Group USA"]
I'd say the HTML isn't well-formed, using span would have been better, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
EOT
div = doc.at('.profile-small-bold')
[div.text.strip, div.next_sibling.text.strip].join(' ')
# => "Private Equity Group USA"
which can be reduced to:
[div, div.next_sibling].map { |n| n.text.strip }.join(' ')
# => "Private Equity Group USA"
The problem is that you have two nested divs, with "USA" trailing, so it's important to point to the inner node which has the main text you want. Then "USA" is in the following text node, which is accessible using next_sibling:
div.next_sibling.class # => Nokogiri::XML::Text
div.next_sibling # => #<Nokogiri::XML::Text:0x3c "\n USA\n">
Note, I'm using CSS selectors; They're easier to read, which is echoed by the Nokogiri documentation. I have no proof they're faster, and, because Nokogiri uses libxml to process both, there's probably no real difference worth worrying about, so use whatever makes more sense, and run benchmarks if you're curious.
You might be tempted to use text against the div class="cl profile-xsmall" node, but don't be sucked into that, as it's a trap:
doc.at('.profile-xsmall').text # => "\n Private Equity Group\n USA\n"
doc.at('.profile-xsmall').text.gsub(/\s+/, ' ').strip # => "Private Equity Group USA"
text will return a string of the text nodes after they're concatenated together. In this particular, rare case, it results in a somewhat usable result, however, usually you'll get something like this:
doc = Nokogiri::HTML('<div><p>foo</p><p>bar</p></div>')
doc.at('div').text # => "foobar"
doc.search('p').text # => "foobar"
Once those text nodes have been concatenated it's really difficult to take them apart again. Nokogiri's documentation talks about this:
Note: This joins the text of all Node objects in the NodeSet:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"
Instead, if you want to return the text of all nodes in the NodeSet:
doc.css('d').map(&:text) # => ["foo", "bar"]
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
Um, no, not according to the HTML you gave us. But, let's pretend.
Using an absolute path to a node is a good way to write fragile selectors. It takes only a small change in the HTML to break your access to the node. Instead, find way-points to skip through the HTML to find the node you want, taking advantage of CSS and XPath to search downward through the DOM.
Typically, a selector like yours is generated by a browser, which isn't a good source to trust. Often browsers do fixups on malformed HTML, which changes it from what Nokogiri or a parser would see, resulting in a non-existing target, or the browser presents the HTML after JavaScript has had a change to run, which can move nodes, hide them, add new ones, etc.
Instead of trusting the browser, use curl, wget or nokogiri at the command-line to dump the file and look at it using a text editor. Then you'll be seeing it just as Nokogiri sees it, prior to any fixups or mangling.

Result of xpath is object text error, how do i get around this in Ruby on a site built around hiding everything?

My company uses ways to hide most data on their website and i'm tying to create a driver that will scan closed jobs to populate an array to create new jobs thus requiring no user input / database access for users.
I did research and it seems this can't be done the way i'm doing it:
# Scan page and place 4 different Users into an array
String name = [nil, nil, nil, nil]
String compare_name = nil
c = 0
tr = 1
while c < 4
String compare_name = driver.find_element(:xpath, '//*
[#id="job_list"]/tbody/tr['+tr.to_s+']/td[2]/span[1]/a/span/text()[2]').gets
if compare_name != name[c]
name[c] = compare_name
c = +1
tr = +1
else if compare_name == name[c]
tr = +1
end
end
end
Also i am a newb learning as i go, so this might not be optimal or whatever just how i've learned to do what i want.
Now the website code for the item i want on the screen:
<span ng-if="job.customer.company_name != null &&
job.customer.company_name != ''" class="pointer capitalize ng-scope" data-
toggle="tooltip" data-placement="top" title="" data-original-title="406-962-
5835">
<a href="/#/edit_customer/903519"class="capitalize notranslate">
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
</a>
</span>
Xpath can find the Diago Stein area, but because of it being a text object it doesn't work. Now to note something all the class titles, button names, etc are all the same with everything else on the page. They always do that which makes it even harder to scan because those same things are likely elsewhere that might not have anything to do with this area of the site.
Is there any way to grab this text without knowing what might be in the text area based on the HTML? Note "Name Stuff" is the name of a company i hid it with this generic one for privacy.
Thanks for any ideas or suggestions and help.
EDIT: Clarification, i will NOT know the name of the company or the user name (in this case Diago Stein) the entire purpose of this part of the code is to populate an array with the customers name from this table on the closed page.
You can back your XPath up one level to
//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span
then grab the innerText. The SPAN is
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
The problem is that this HTML has some conditionals in it which makes it hard to read, hard to figure out what's actually there. If we strip out the conditional, we are left with
<span class="ng-binding">Name Stuff<br>Diago Stein</span>
If we take the innerText of this, we get
Name Stuff
Diago Stein
What this does is you can split the string by a carriage return and part 0 is the 'Name Stuff' and part 1 is 'Diago Stein'. So you use your locator to find the SPAN, get innerText, split it by a carriage return, and then take the second part and you have your desired string.
This code isn't tested but it should be something like
name = driver.find_element(:xpath => "//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span").get_text.split("\n")[1]

Split using multiple keywords using regex

Well I have a string containing (actually without line breaks)
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
and I want an array to store the string which is split using href=" and /"> specifically. How can i do that. I have tried this out.
new_array=my_string.split(/ href=" , \/">/)
Edit:
.split(/href="/)
This works out too good but not with the other part.
.split(/\/">/)
Similarly this works too But i am unable to combine them together into 1 line.
Given this string:
string = <<-HTML
<td class="coll-1 name">
<i class="flaticon-divx"></i>
SAME stuff here
<span class="comments"><i class="flaticon-message"></i>1</span>
</td>
HTML
and assuming that the correct link is the one without icon class, you could use the CSS selector a:not(.icon), for example via Nokogiri:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(string)
doc.at_css('a:not(.icon)')[:href]
#=> "/torrent/2349324/some-stuuf-here/"
You can take advantage of lookahead and lookbehind, like this:
my_string.scan(/(?<=href=").*(?=\/">)/)
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array with all occurrences of href=" ... /"> with only the ... part (which can be any string).
Or you can get everything that matches href=".../"> and then remove href=" and the trailing /">, something like this:
my_string.scan(/(?:href=".*\/">)/).map { |e| e.gsub(/(href="|\/">)/, "") }
#=> ["/torrent/2349324/some-stuuf-here"]
This will return an array of all instances that match /href=".*\/">/.
How do i split using 2 keywords using regex
You can use a | to denote an or in regex, like this:
my_string.split(/(?:href="|/">)/)

I need to verify whether a label "Annuitant" and its value " RPD" is present or not using parent class

<div class="col-sm-3">
<span>Annuitant:</span>
</div>
<div class="col-sm-3">
<span id="annuitant">
RPD
</span>
</div>
Xpath code that i used previously
findXpath=page.find('label', text: workbook.cell(j,k), :match => :prefer_exact).path
splitXpath=(findXpath.split("/")) #splitting xpath
##Xpath manipulation to get the xpath of "RPD"
count1=splitXpath.count
value1=splitXpath.at(count1-3)
value=splitXpath.at(count1-2)
labelNum=value1.match(/(\d+)/)
i=0
elementNum=labelNum[1].to_i+1
for maxnum in 1..splitXpath.count-4
elementXpath=elementXpath + "/" + splitXpath[maxnum]
end
elementXpath=elementXpath + "/div[" + elementNum.to_s + "]" + "/"+ value
elementXpath=elementXpath + "/" + splitXpath.at(count1-1)
finalElementXpath=elementXpath.sub("label","span")# obtained the xpath of RPD
if (workbook.cell(j+1,k) == (find(:xpath, finalElementXpath).native.text)) # verifying the value RPD is present
Can I use parent class and verify whether "Annuitant" is present and also to check whether Annuitant value is "RPD". Please help me to write a code for this in ruby capybara
Use assert_selector to check if the selector has the text you want. See below:
page.assert_selector('#annuitant', :text => 'RPD', :visible => true)
You can scope Capybara's finders/matchers to any element by either calling them on an element or using within(element) ...
In this case you'd want to scope to at least one level higher in your html document so that both elements you are interested in are contained by the element you're scoping too. Also the class 'col-sm-3' would be a bad choice because it is not going to be unique to these elements. Another thing this comes down to is how rigorous does your check need to be, do you actually need to check the structure of the elements or do you just need to verify the text appears next to each other on the page. If the latter something like
element = find('<selector for parent/grandparent of both elements>') # could also just be `page` if the text is unique
expect(element).to have_text('Annuitant: RPD')
if you do actually need to verify the structure things get more complicated and you would need to use XPath
expect(element).to have_selector(:xpath, './/div[./span[text()="Annuitant:"]]/following-sibling::div[1][./span[normalize-space(text())="RPD"]]')

I need a regex to find a url which is not inside any html tag or an attribute value of any html tag

I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo&params2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo&params2=bar">
http://someurl.com?param1=foo&params2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]

Resources