retrieve text from <p> on landing page using ruby watir - ruby

I have to retrieve the text from the web page and put it on console.
I am not able to get the text from this html below. Can anyone please help me on this.
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row">
</div>
I tried b.div(:class => 'twelve columns').exist? on irb and it says true.
I tried this - b.div(:class => 'twelve columns').text, and it returns me the text on the header not in paragraph.
I tried with - b.div(:class => 'twelve columns').p.text, it returned me error - unable to locate element, using {:tag_name=>"p"}

Simply doing this on example you wrote worked for me:
browser.div(:class => 'twelve columns').p.text
Your best bet would be to check your page css for actually having provided elements structure, as well as that they are nested properly.

I slightly fixed you HTML:
<div class="twelve columns">
<h1>Your product</h1>
<p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<div class="row"></div>
</div>
Let's do a tiny example:
div = b.div(:class => 'twelve columns')
Enumeration of elements as follows:
div.elements.each do |e|
p e
end
Will do something like that:
<Watir::HTMLElement ... # <h1>Your product</h1>
<Watir::HTMLElement ... # <p>21598: DECLINE: Decline - Property Type not acceptable under this contract</p>
<Watir::HTMLElement ... #<div class="row">
If you want to specify child element P from the DIV do this:
p = div.p
or
p = div.element( :tag_name => 'p' )
And when get text of P:
p.text # >> 21598: DECLINE: Decline - Property Type not acceptable under this contract
Or event do with your single string:
b.div(:class => 'twelve columns').p.text
=> "21598: DECLINE: Decline - Property Type not acceptable under this contract"

Related

output XML nodes out into individual files

I am trying to create individual files from the nodes of a XML file. My issue is no matter what way I try it I seem to be getting stuck in a nested loop and I either keep rewriting each file until they are just the same node data over and over, or I run all of the nodes per loop instance. I'm sure this should be pretty easy but I'm getting hung up somewhere.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
item.xpath("//div[#class='meeting-date']/span/#content").each do |date|
date = date.to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
end
This is another attempt that I don't understand why is failing to create all the pages. This only creates one page, but if I use a "puts" the count does iterate through all 101 nodes.
doc = Nokogiri::XML(open("original_copy_mod.xml"))
doc.xpath("//nodes/node").each do |item|
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
split_date = date.split('T00')
split_date = split_date[0].gsub("-","_")
split_date = split_date + ".pcf"
File.open(split_date,'w'){ |f| f.write(item)}
end
For further clarification, this is an example of the nodes that I'm trying to create into pages.
<?xml version="1.0" encoding="UTF-8" ?>
<nodes>
<node>
<no-name><div class="meeting-title">Meeting-a</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="a-2021-11-29T00:00:00-06:00">Monday, November 29, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-b</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="e-2021-09-10T00:00:00-05:00">Friday, September 10, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
<node>
<no-name><div class="meeting-title">Meeting-c</div>
<div class="meeting-date"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="f-2021-08-13T00:00:00-05:00">Friday, August 13, 2021</span></div>
</no-name>
<no-name><div class="past-mtg-icons">
<div>
<span><img src="agenda-icon.svg"/></span>
<span>Agenda</span>
</div>
<div>
<span><img src="webcast-icon.svg"/></span>
<span>11/29</span>
</div>
</div>
<div class="meeting-body"></div></no-name>
</node>
</nodes>
date = item.xpath("//no-name/div[#class='meeting-date']/span/#content").to_s
By using // you are breaking out of the scope of the node you are iterating in. Removing the slashes you preserve the scope of the node.
date = item.xpath("no-name/div[#class='meeting-date']/span/#content").to_s
When you use w option it always rewrite onto the file. What you need is to create or append to the file, it's done with the a option. So you can try this:
File.open(split_date,'a'){ |f| f << item }
PS. Be sure that split_date as the name of the file is uniq for each node since you want a separate file per node

RegEx code works in theory but not when code is run

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.
my code:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end
some of what i'm searching through:
<div class="ms3">
<span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
</div>
<div class="Paragraph">
<span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
</div>
<div class="Stanza_Break"></div>
The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else
Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).
You should have better luck reading the whole text as a block, and using match on it:
File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span>
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span>
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18);
# => the futility of offering sacrifices unmatched by common justice is once more
# => underlined, and exile seems certain (5.21–27).</span></span>\n </div>" 1:"\n ">

Scrapy can't find XPath content

I'm writing a web crawler with Scrapy to download the text of talk-backs on a certain webpage.
Here is the relevant part of the code behind the webpage, for a specific talkback:
<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......
While writing an XPath to get the the message:
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
and later on:
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
There's no bug, but it doesn't work. Any ideas why? I suppose I'm not writing the path correctly, but I can't find the error.
Thank you :)
The whole code:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[#class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[#class='talkback-message']text()").extract()
items.append(item)
return items
Here's a snippet of the HTML page for #site_comment_74240
<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
144. מדיניות
</div>
<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי </td>
<td>(01.11.2013)</td>
</tr></table>
</div>
The "talkback-message" div is not in the HTML page when you first fetch it, but rather is fetched asynchronously via some AJAX query when you click on a comment title, so you'll have to fetch it for each comment.
Comment blocks, titles in you code snipper, can be grabbed using an XPath like this: //div[starts-with(#id, "site_comment_"]), i.e. all divs that have an "id" attribute beginning with string ""site_comment_"
You can also use CSS selectors with Selector.css(). In your case, you can grab comment blocks using either the "id" approach (as I've done above using XPath), so:
titles = sel.css("div[id^=site_comment_]")
or using the "site_comment" class without the other "site_comment-even", "site_comment-odd", "small", "normal-rank" or "high-rank" that vary:
titles = sel.css("div.site_comment")
Then you would issue a new Request using the URL that's in ./div[#class="talkback-topic"]/a[#class="show-comment"]/#data-ajax-url inside that comment div. Or using CSS selectors, div.talkback-topic > a.show-comment::attr(data-ajax-url) (by the way, the ::attr(...) is not standard, but is a Scrapy extension to CSS selectors using pseudo elements functions)
What you get from the AJAX call is some Javascript code, and you want to grab the content inside old.after(...)
var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל <\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&html_id=site_comment_72765&sibling_id=72765\">תגובה חדשה<\/a>\n \n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n \n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");
This is HTML data that you'll need to parse again using something Selector(text=this_ajax_html_data) and a .//div[#class="talkback-message"]//text() XPath or div.talkback-message ::text CSS selector
Here's a skeleton spider to get you going with these ideas:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]
def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[#class='talkback-message']text()").extract()
# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break
# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")
# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")
# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item
You can try it out using scrapy runspider tbkspider.py.

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

Locating element in same paragraph of another element in watir-webdriver

Given the following HTML code snippet; after finding the link by ID, how would you select the checkbox in the same paragraph?
For example if I wanted to select the checkbox associated with the link with ID="inst_17901-1746-1747".
The order of the paragraphs in the DIV is not consistent between sessions so I cannot select it by index or ID of the checkbox.
<div id="inst-results">
<p>
<input id="inst-results0-check" type="checkbox">
<a class="ws-rendered" id="inst_17901-1746-1747" title="!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
<p>
<input id="inst-results1-check" type="checkbox"><a class="ws-rendered" id="inst_17882-1746-1747" title="!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
</div>
I figured out this solution working off the text of the link, but Zeljko solution is much better.
$browser.div(:id,"inst-results").ps.each { |para|
if para.link.text == "!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range" then
para.checkbox.set
break
end
}
If there is only one checkbox in the paragraph with the link:
browser.link(:id => "inst_17901-1746-1747").parent.checkbox.set
Works with watir-webdriver, not sure if it would work with other Watir gems.

Resources