RegEx code works in theory but not when code is run - ruby

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.
my code:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end
some of what i'm searching through:
<div class="ms3">
<span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
</div>
<div class="Paragraph">
<span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
</div>
<div class="Stanza_Break"></div>
The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else

Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).
You should have better luck reading the whole text as a block, and using match on it:
File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span>
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span>
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18);
# => the futility of offering sacrifices unmatched by common justice is once more
# => underlined, and exile seems certain (5.21–27).</span></span>\n </div>" 1:"\n ">

Related

How to get text which has no HTML tag

Following is the HTML:
<div class="ajaxcourseindentfix">
<h3>CPSC 353 - Introduction to Computer Security (3) </h3>
<hr>Security goals, security systems, access controls, networks and security, integrity, cryptography fundamentals, authentication. Attacks: software, network, website; management considerations, security standards in government and industry; security issues in requirements, architecture, design, implementation, testing, operation, maintenance, acquisition, and services.
<br>
<br>Prerequisite: CPSC 253U
<span style="display: none !important"> </span> or CPSC 254
<span style="display: none !important"> </span> and CPSC 351
<span style="display: none !important"> </span>
, declared major/minor in CPSC, CPEN, or CPEI
<br>
</div>
I need to fetch the following text from this HTML:
From Line 6 - or
From Line 7 - and
, declared major/minor in CPSC, CPEN, or CPEI
I am able to get the href [Course number: CPSC 254 etc...] with the following XPath:
# This xpath gives me all the tags followed by h3 and then I iterate through them in my script.
//div[#class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::*
Update
And, then the text with the following XPath:
# This xpath gives me all the text after the h3 tag.
//div[#class='ajaxcourseindentfix']/h3/following-sibling::text()[2]/following-sibling::text()
I need to have these course name/prerequisite in the same way they are at URL 1.
In this approach I am getting all the HREF first, then all text. Is there a better way to achieve this? I don't want to iterate over 2 XPaths to get the HREF first, then Text and after that club them to form the prerequisite string.
1 http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=99648&show
Try to use below code to get required output:
div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]
The output is
'CPSC 253U or CPSC 254 and CPSC 351 , declared major/minor in CPSC, CPEN, or CPEI'

XPath to extract text within br

<div id="t_info" class="tab-pane fade active in tab">
<br><strong>Delivery</strong> <br>
<br><br><br><strong>Model Name</strong> : BP250
<br>
<br>Full HD up-scaling dramatically improves the resolution of any original content to Full HD.
<br>
<br><strong>Barcode</strong> : 8806087225921
<br>
<br><strong>Product Type</strong> : Blu-ray Player<br>
<br>Blu-Ray Disc <br>External <br></div>
I need xpath to capture the barcode value. Location of the barcode varies depending on the description.
I have tried //*[text()='Barcode'] . but i cant capture the value.
In your case you can use next XPath:
(//div[#id="t_info"]/text())[./preceding::strong[text()='Barcode']][1]
Please note that it is mauvais ton (bad manners)

retrieve text from <p> on landing page using ruby watir

I have to retrieve the text from the web page and put it on console.
I am not able to get the text from this html below. Can anyone please help me on this.
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row">
</div>
I tried b.div(:class => 'twelve columns').exist? on irb and it says true.
I tried this - b.div(:class => 'twelve columns').text, and it returns me the text on the header not in paragraph.
I tried with - b.div(:class => 'twelve columns').p.text, it returned me error - unable to locate element, using {:tag_name=>"p"}
Simply doing this on example you wrote worked for me:
browser.div(:class => 'twelve columns').p.text
Your best bet would be to check your page css for actually having provided elements structure, as well as that they are nested properly.
I slightly fixed you HTML:
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row"></div>
</div>
Let's do a tiny example:
div = b.div(:class => 'twelve columns')
Enumeration of elements as follows:
div.elements.each do |e|
p e
end
Will do something like that:
<Watir::HTMLElement ... # <h1>Your product</h1>
<Watir::HTMLElement ... # <p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<Watir::HTMLElement ... #<div class="row">
If you want to specify child element P from the DIV do this:
p = div.p
or
p = div.element( :tag_name => 'p' )
And when get text of P:
p.text # >> 21598: DECLINE: Decline - Property Type not acceptable under this contract
Or event do with your single string:
b.div(:class => 'twelve columns').p.text
=> "21598: DECLINE: Decline - Property Type not acceptable under this contract"

WebDriver Capture Text by XPath

I am attempting to capture a line of text for an automated WebDriver test to use it in a comparison later on. However, I cannot find an XPath that will work with WebDriver. I have used the text() function before to capture text that is not in a tag, but in this instance that is not working. Here is the HTML, note that this text will never be the same, so I cannot use contains or similar functions.
<div id="content" class="center ui-content" data-role="content" role="main">
<div data-iscroll="scroller">
<div class="ui-corner-all ui-controlgroup ui-controlgroup-vertical" data-role="controlgroup">
<a class="ui-btn ui-corner-top ui-btn-hover-c" style="text-align: left" data-role="button" onclick="onDocumentClicked(21228772, "document.php?loan=********&folderseq=0&itemnum=21228772&pageCount=3&imageTypeName=1003 Application - Final&firstInitial=&lastName=")" href="#" data-corners="true" data-shadow="true" data-iconshadow="true" data-wrapperels="span" data-theme="c">
<span class="ui-btn-inner ui-corner-top">
<span class="ui-btn-text">
<img class="checkMark checkMark21228772 notViewedCompletely" width="15" height="15" title="You have not yet viewed this document." src="../images/white_dot.gif"/>
1003 Application - Final. (Jan 11 2012 5:04PM)
</span>
</span>
</a>
In this example, the text I am attempting to capture is: 1003 Application - Final. (Jan 11 2012 5:04PM)
I have inspected the element with Firebug and I have tried the following XPaths with no success.
html/body/div[1]/div[2]/div/div/a[1]/span/span
html/body/div[1]/div[2]/div/div/a[1]/span/span/text()
The WebDriver test is being written in C#.
You can either use this
driver.FindElement(By.XPath(".//div[#id='content']/following-sibling::span[#class='ui-btn-text']")
or
var elem = driver.FindElement(By.Id("Content"));
string text = string.Empty;
if(elem!=null) {
var textElem = elem.FindElement(By.Xpath(".//following-sibling::span[#class='ui-btn-text']"));
if(textElem!=null) text = textElem.Text();
}
I was able to solve this issue by removing the span tags from the XPath.
GetText("html/body/div[3]/div[2]/div/div/a[1]", SelectorType.XPath);
python webdriver code looks something like
driver.find_element_by_xpath("//span[#class='ui-btn-text']").text
But locator may be not uniqe, because I can't see all the code
PS Try to never use locators like html/body/div[1]/div[2]/div/div/a[1]/span/span
Approach:
Find the CSS Selector from the Given DOM
Derived CSS:css=#content div.ui-controlgroup > a[onclick*='onDocumentClicked'] > span > span
Use the C# Library Method to get the Text.

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

Resources