Where is this text coming from with poltergeist? - ruby

I'm scraping my library's website with Poltergeist, in my first experience with that gem (or with Capybara, for that matter). It's working great. Super great.
def self.scrape_book_list(url)
session = Capybara::Session.new(:poltergeist)
session.visit(url)
books = session.all('.js-titleCard')
books_hash = books.map { |book|
# getting info from the session
}
books_hash
end
However, after the session.visit(url) line, before it even does anything else, it prints this:
Hi there! This site is powered by OverDrive and our vision is a world enlightened by reading. Maybe a curious cat like you can help https://company.overdrive.com/company/careers/open-positions/
I've tried inspecting the page in Chrome, and even peeking at a few js sources, but I can't seem to figure out where this text is coming from!
I imagine the question is "Why/how is poltergeist doing this?" and I figured that searching the html or js code would turn the text up in some tag from the header that poltergeist perhaps always prints when it visits a page or something (maybe there's a different method to pass the url to besides visit that won't do this). But no luck!
I'm so curious (like the cat they mention)! Any ideas?

That text will be coming from a console.log(...) statement somewhere in the sites JS. By default Poltergeist outputs all JS console logs to stdout.

Related

Confused about scrapy and Xpath

I am trying to scrape some data from the following website: https://xrpcharts.ripple.com/
The data I am interested in is Total XRP which you can see immediately below or to the side (depending on your browser) of the circle diagram. So what I first did was inspect the element I am interested in. So I see that it is inside <div class="stat" inside span ng-bind="totalXRP | number:2" class="ng-binding">99,993,056,930.18</span>.
The number 99,993,056,930.18 is what I am interested in.
So I started in a scrapy shell and wrote:
fetch("https://xrpcharts.ripple.com")
I then used chrome to copy the Xpath by right clicking on that place of HTML code, the result chrome gave me was:
/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span
Then I used the Xpath command to extract the text:
response.xpath('/html/body/div[5]/div[3]/div/div/div[2]/div[3]/ul/li[1]/div/span/text()').extract()
but this gave me an empty list []. I really do not understand what I am doing wrong here. I think I am making an obvious mistake but I dont see it. Thanks in advance!
The bottom line is: you cannot expect the page you see in the browser to be the same page Scrapy would download and have available to work with. Scrapy is not a browser.
This page is quite dynamic and complex and is constructed with the help of multiple asynchronous requests bringing in both the logic and the data. There is also JavaScript executed in the browser that plays an important role in forming and supporting the HTML document object tree.
Scrapy does not have all these things, the thing you get when you do fetch() is just the very first initial "bare bones" HTML page without all the "dynamic content".

CKEditor with HTML content stores, displays but cannot display for edit

I have used CKEditor for a few years without really understanding it. I now want to use it to display text which will include HTML, CSS, JavaScript and PHP example code. None of that needs to execute it is just to show the code to others.
Currently I used the textarea replace method to edit content and I need to carry on that way. When I add the content first time it is sanitised (mysqli_real_escape_string) and stored in a MySQL database correctly. It also then displays correctly with the CKEditor markup working as markup and the HTML/PHP showing as a code example. However, when I edit the content a second time the HTML examples become "real" HTML and are no longer visible as examples.
For example this:
<?php echo "hello"; ?>
<p>Hello</p>
is correctly (?) stored as:
<p><?php echo "me"; ?></p>
<p><p>Hello</p></p>
and displays on the page as shown in the first code snippet (which is what I want). When I then hit edit again the code examples vanish into the background as real HTML (part of the page). If I put the code examples in as code snippets (which I would rather not have to do because of the intended users) the result in the editor (second edit) looks like this:
<!--?php echo "me"; ?-->
Hello
I am sure i am missing a basic understanding of what is going on behind the scenes but can anyone explain how to allow users to type in text which includes HTML, CSS, JavaScript, PHP and MySQL code examples which must then appear as examples and not markup (and be editable as examples).
I have played with config.entities and config.protectedSource after some research but they do not seem to be relevant (or to work). Weirdly a couple of times it seemed to work fine and I thought I had cracked it but then stopped with no further changes to the config. That means I now have less idea what I am doing than when I started!
You don't mention which version you are using, but if it's relatively new (4.4+) you can use the Code Snippets plugin that was designed exactly for this. See the demo at http://ckeditor.com/demo#widgets. It might help with the encoding issues too. There's docs on it too.
Th help with the current encoding issue, it would help a LOT if you showed us how you output the data and load it into CKEditor. For example "When I then hit edit again" doesn't really describe anything without context. For example, do you use setData() with AJAX? Do you use an inline editor? Code examples would be the best.

How to select from frames with the Watir Ruby Gem

When trying to select a list element's option I attempted to do:
myvar=ie.select_list(:id, 'myid').option(:text, 'mytext').select
But for some reason while I'm using Watir in irb to access the website and attempting to manipulate any of the items I get this exception.
Watir::Exception::UnknownObjectException: Unable to locate element...etc
I'm looking at page in the browser but using .html isn't showing the full page. It looks like the rest of the page is hidden and I'm not sure how to get into/around this.
irb(main):011:0> ie.html
=> "<HTML><HEAD><TITLE>My Title</TITLE>\r\n
<SCRIPT language=JavaScript type=text/javascript src=\"../../script.js\"></SCRIPT>\r\n</HEAD><FRAMESET id=mainFrameSet name=mainFrameSet rows=100%,0%><FRAME id=frmMain src=\"DefaultT.cfm?ID=2197024\" name=frmMain><FRAME id=frmHidden src=\"Dummy.html\" name=frmHidden scrolling=no></FRAMESET></HTML>"
EDIT:
Looking at this in retrospect I have changed the title so it would more accurately address the issue I was having. It was difficult for a new waiter user to find information like on Watir and Frames. The original title was something like "Using Watir On An Encrypted Site". I have severely edited the question to get to the essence of what I was asking. I can't thank those enough who attempted to answer the ramblings of a new Ruby user with minimal knowledge of the Web and programming in general. Please see previous revisions if necessary.
Based on the html you added, your webpage is using frames. Unlike other elements, you have to explicitly specify the frames you want to use.
You probably want the frame with id 'frmMain', so try:
myvar=ie.frame(:id, 'frmMain').select_list(:id, 'myid').option(:text, 'mytext').select
My guess is that the element is not on the page when you try to access it.
Try this (please notice when_present):
myvar=ie.select_list(:id, 'myid').when_present.option(:text, 'mytext').select
More information: http://watirwebdriver.com/waiting/

click on xpath link with Mechanize

I want to click a link with Mechanize that I select with xpath (nokogiri).
How is that possible?
next_page = page.search "//div[#class='grid-dataset-pager']/span[#class='currentPage']/following-sibling::a[starts-with(#class, 'page')][1]"
next_page.click
The problem is that nokogiri element doesn't have click function.
I can't read the href (URL) and send get request because the link has onclick function defined (no href attribute).
If that's not possible, what are the alternatives?
Use page.at instead of page.search when you're trying to find only one element.
You can make your selector simpler (shorter) by using CSS selector syntax:
next_page = page.at('div.grid-dataset-pager > span.currentPage + a[class^="page"]')
You can construct your own Link instance if you have the Nokogiri element, page, and mechanize object to feed the constructor:
next_link = Mechanize::Page::Link.new( next_page, mech, page )
next_link.click
However, you might not need that, because Mechanize#click lets you supply a string with the text of the anchor/button to click on.
# Assuming this link text is unique on the page, which I suspect it is
mech.click next_page.text
Edit after re-reading the question completely: However, none of this is going to help you, because Mechanize is not a web browser! It does not have a JavaScript engine, and thus won't (can't) execute your onclick for you. For this you'll need to use Ruby to control a real web browser, e.g. using Watir or Selenium or Celerity or the like.
In general you would do:
page.link_with(:node => next_link).click
However like Phrogz says, this won't really do what you want.
Why don't you use a hpricot element instead? Mechanize can click on a hpricot element as long as the link has a 'src' or 'href' attribute. Try something along these lines:
page = agent.get("http://www.example.com")
next_page = agent.click((page/"//your/xpath/a"))
Edit After reading Phrogz answer I also realized that this won't really do it. Mechanize doesn't support Javascript yet. With this in mind you have 3 options.
Use a library that controls a real web browser. See #Phrogz answer.
Use Capybara which is an integration testing library but can also be used as a stand alone crawler. I've done this successfully with HTMLUnit which is a also an integration testing library in Java. Capybara comes with Selenium support by default though it also supports Webkit via an external gem. Capybara interprets Javascript out of the box. This blog post might help.
Grok the page that you intend to crawl and use something like HTTPFox to monitor what the onclick Javascript function does and replicate this in your Mechanize script.
Good luck.

How to implement Watir classes (e.g. PageContainer)?

I'm writing a sample test with Watir where I navigate around a site with the IE class, issue queries, etc..
That works perfectly.
I want to continue by using PageContainer's methods on the last page I landed on.
For instance, using its HTML method on that page.
Now I'm new to Ruby and just started learning it for Watir.
I tried asking this question on OpenQA, but for some reason the Watir section is restricted to normal members.
Thanks for looking at my question.
edit: here is a simple example
require "rubygems"
require "watir"
test_site = "http://wiki.openqa.org/"
browser = Watir::IE.new
browser.goto(test_site)
# now if I want to get the HTML source of this page, I can't use the IE class
# because it doesn't have a method which supports that
# the PageContainer class, does have a method that supports that
# I'll continue what I want to do in pseudo code
Store HTML source in text file
# I know how to write to a file, so that's not a problem;
# retrieving the HTML is the problem.
# more specifically, using another Watir class is the problem.
Close browser
# end
Currently, the best place to get answers to your Watir questions is the Watir-General email list.
For this question, it would be nice to see more code. Is the application under test (AUT) opening a new window/tab that you were having trouble getting to and therefore wanted to try the PageContainer, or is it just navigating to a second page?
If it is the first one, you want to look at #attach, if it is the second, then I would recommend reading the quick start tutorial.
Edit after code added above:
What I think you missed is that Watir::IE includes the Watir::PageContainer module. So you can call browser.html to get the html displayed on the page to which you've navigated.
I agree. It seems to me that browser.html is what you want.

Resources