Scroll and read slickgrid table into memory - ruby

I have a slickgrid table that I am trying to read into memory using watir-webdriver. Because the full data often cannot be seen without scrolling down, I want to make a function that can scroll through the table and also be able to tally a count of all the rows as well as access any row that might or might not be hidden within it. Here's what I have so far:
class SlickGridTable
def initialize(element)
#element = element
end
...
def scroll_down
location_y = 23
while true
location_y += 1
$browser.execute_script("arguments[0].scrollBy(0, #{location_y});", #element)
end
end
end
However I am regularly getting this error:
Selenium::WebDriver::Error::UnknownError: unknown error: undefined is not a function

I am also working with slickgrid, and considered a similar approach. Instead, I extended the Watir::Div class with a scroll_until_present method. Now we can scroll until present and then work with the data in the grid. I have not had the need to collect all the data after implementing this. Does not solve your problem with tallying rows, but does help find the records you are expecting to see.
# Extends the Watir::Div class to support slick grids
module Watir
class Div
# scrolls until it finds the item you are looking for
# can be used like wait_until_present
def scroll_until_present
scroll_height = browser.execute_script('return document.getElementsByClassName("slick-viewport")[0].scrollHeight')
(0..scroll_height).step(20).each { |item|
browser.execute_script('document.getElementsByClassName("slick-viewport")[0].scrollTop = ' + item.to_s)
if present?
# scroll a little more once the record is found
item += 30
browser.execute_script('document.getElementsByClassName("slick-viewport")[0].scrollTop = ' + item.to_s)
break
end
}
end
end
end

Related

Looking for a cleaner way to scrape from website by avoiding repeating

Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.

How to write a while loop properly

I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.

Ruby PageObject Design for Similar Page Sections

I'm using the Cheezy Page Object gem (which also means I'm using Watir, which also means I'm using Selenium). I also have the watir gem explicitly loaded.
Anyway I have a site I am modeling with the UI written in angular where there is 1 page whose contents change based on dropdown selection. The page has several sections but it is visibly the same for each dropdown choice. The only difference is the xpath locators I am using to get there (there's no unique ID on the sections).
So for example I have an xpath like html/body/div[1]/div/div[1]/div/**green**/div/div[1]
and another like
html/body/div[1]/div/div[1]/div/**red**/div/div[1]
The elements on the sections strangely all have the same ID attribute and same class name. So I've been using xpath for the elements since that appears to make it a unique locator.
Problem is there are currently seven dropdown choices each with several sections like this. And they have visibly same elements and structure (from end user perspective) but when you look at html the only difference is the locator so like this for the elements:
html/body/div[1]/div/div[1]/div/green/div/div[1]/**<element>**
and another like
html/body/div[1]/div/div[1]/div/red/div/div[1]/**<element>**
In my current design I have created one page and created page sections for each section on a page. Multiply the number of page sections with number of dropdown choices and you see it is alot. Some of the choices do generate extra elements but there are still common elements between all sections. I also have to duplicate all of these elements across the seven different pages because the xpath is different. Is there some way for me to pass some initializer to the PageObject page_section like the type-a or type-b string and then based on that I can also choose correct xpath for all elements?
So like if I have text field like so in like a base page object page_section:
text_field(:team, xpath: "...#{type_variable}")
Can I do something like section = SomePageObject.page_section_name(type_variable)?
EDIT: Adding Page Object code per request
class BasePO
include PageObject
#Option S1 Cards
page_section(:options_red_card, OptionRedCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red/div[2]/div[2]/div/div/div/div")
page_section(:options_green_card, OptionGreenCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/green/div[2]/div[2]/div/div/div/div")
page_section(:options_yellow_card, OptionYellowCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/yellow/div[2]/div[2]/div/div/div/div")
#Detail S2 Cards
page_section(:detail_red_card, DetailRedCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red/div[1]/div/div/div")
page_section(:detail_green_card, DetailGreenCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/green/div[1]/div/div/div")
page_section(:detail_yellow_card, DetailYellowCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/yellow/div[1]/div/div/div")
end
EDIT2: Adding page_section content per request. All Option Cards share these elements at a minimum. Different elements in the Detail Cards but same structure as Option Cards.
class OptionRedCard
include PageObject
def field1_limit
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[2]/td[2]/div/div[1]/div/currency/div/input")
end
def field1_agg
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[2]/td[2]/div/div[2]/div/currency/div/input")
end
def field2_limit
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[3]/td[2]/div/div[1]/div/currency/div/input")
end
def field2_agg
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[3]/td[2]/div/div[2]/div/currency/div/input")
end
def field3_limit
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[4]/td[2]/div/div[1]/div/currency/div/input")
end
def field3_agg
text_field_element(xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red-unit/div[2]/div[2]/div/div/div/red/form/div/div/div/table/tbody/tr[4]/td[2]/div/div[2]/div/currency/div/input")
end
def field1_agg_value
field1_agg.attribute_value('data-value')
end
def field2_agg_value
field2_agg.attribute_value('data-value')
end
def field3_agg_value
field3_agg.attribute_value('data-value')
end
end
I think the short answer to your question, is no, there is no built-in support for passing a value to the page sections. However, here are some alternatives I can think of.
Option 1 - Use initialize_accessors
Usually the accessors are executed at compile time. However, you could use the #initialize_accessors method to defer the execution until the initialization of the page object (or section). This would let you define your accessors in a base class that, at initialization, inserts color type into the paths:
class BaseCard
include PageObject
def initialize_accessors
# Accessors defined with placeholder for the color type
self.class.text_field(:field1_limit, xpath: "/html/body/some/path/#{color_type}/more/path/input")
end
end
# Each card class would define its color for substitution into the accessors
class OptionRedCard < BaseCard
def color_type
'red'
end
end
class OptionGreenCard < BaseCard
def color_type
'green'
end
end
class BasePO
include PageObject
page_section(:options_red_card, OptionRedCard, xpath: '/html/body/path')
page_section(:options_green_card, OptionGreenCard, xpath: '/html/body/path')
end
Option 2 - Using relative paths
My suggested approach would be to use relative paths such that the color can be removed from the path of the page section. From the objects provided, you might be able to do something like:
class OptionCard
include PageObject
element(:unit) { following_sibling(tag_name: "#{root.tag_name}-unit") }
div(:field1_limit) { unit_element.tr(index: 1).text_field(index: 0) }
div(:field1_agg) { unit_element.tr(index: 1).text_field(index: 1) }
div(:field2_limit) { unit_element.tr(index: 2).text_field(index: 0) }
div(:field2_agg) { unit_element.tr(index: 2).text_field(index: 1) }
end
class BasePO
include PageObject
# Page sections only defined to the top most element of the section (the color element)
page_section(:options_red_card, OptionCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/red")
page_section(:options_green_card, OptionCard, xpath: "/html/body/app-component/app-page/div[2]/div/div/div[1]/div/div/div/ngb-tabset/div/div/green")
end

Refresh Capybara's session DOM contents

I am trying to parse a website using Selenium/Capybara. Right now it looks like this:
session = Capybara::Session.new(:selenium)
session.visit "https://somesite.com/page1"
element = session.all( :css, '.table .row a' ).each do |el|
el.click
# get some element's data
session.evaluate_script('window.history.back()')
end
# repeat
Problem is, when I'm trying to get the data from the second page I am on, Capybara tells me that Either the element is no longer attached to the DOM or the page has been refreshed. which absolutely makes sense, however I'm struggling to find a way to create a new DOM variable and parse it in the documentation.
Same thing happens if I navigate back trying to repeat the actions and click on the second link in a row. I assume I need to re-create the session or is there a better way?
To work like you're trying you're going to need to keep a counter and find the elements each time through your loop - something along the lines of
counter = 0
while (el = session.all( :css, '.table .row a', minimum: 1 )[counter]) do
el.click
# get some element's data
counter += 1
session.go_back
end
or if the links are just standard you could gather the hrefs and then just visit them
element = session.all( :css, '.table .row a', minimum:1 ).map {|a| a['href']} do |url|
session.visit(url)
# get some element's data
end

pageobject - when_visible for all elements

I am using a combination of cucumber and pageobject to test my web application. Sometimes, the script tries to click an element even before the page that contains the element starts loading. (I confirmed this by capturing the screenshots of failing scenarios)
This inconsistency is not wide-spread and it happens repeatedly only for a few elements. Instead of directly accessing those elements, if I do example_element.when_visible.click, the test suite always passes.
As of now, I click a link using link_name (generated by pageobject module on calling link(:name, identifier: {index: 0}, &block)
I would like to not edit the above mentioned snippet, but act as if i called link_name_element.when_visible.click. The reason is, the test suite is pretty large and it would be tedious to change all the occurences and I also believe that the functionality is already present and somehow I don't see it anywhere. Can anybody help me out?!
This seems solution seems quite hacky and may not be considering some edge cases. However, I will share it since there are no other answers yet.
You can add the following monkey patch assuming that you are using watir-webdriver. This would be added after you require page-object.
require 'watir-webdriver'
require 'page-object'
module PageObject
module Platforms
module WatirWebDriver
class PageObject
def find_watir_element(the_call, type, identifier, tag_name=nil)
identifier, frame_identifiers, wait = parse_identifiers(identifier, type, tag_name)
the_call, identifier = move_element_to_css_selector(the_call, identifier)
if wait
element = #browser.instance_eval "#{nested_frames(frame_identifiers)}#{the_call}.when_present"
else
element = #browser.instance_eval "#{nested_frames(frame_identifiers)}#{the_call}"
end
switch_to_default_content(frame_identifiers)
type.new(element, :platform => :watir_webdriver)
end
def process_watir_call(the_call, type, identifier, value=nil, tag_name=nil)
identifier, frame_identifiers, wait = parse_identifiers(identifier, type, tag_name)
the_call, identifier = move_element_to_css_selector(the_call, identifier)
if wait
modified_call = the_call.dup.insert(the_call.rindex('.'), '.when_present')
value = #browser.instance_eval "#{nested_frames(frame_identifiers)}#{modified_call}"
else
value = #browser.instance_eval "#{nested_frames(frame_identifiers)}#{the_call}"
end
switch_to_default_content(frame_identifiers)
value
end
def parse_identifiers(identifier, element, tag_name=nil)
wait = identifier.has_key?(:wait) ? false : true
identifier.delete(:wait)
frame_identifiers = identifier.delete(:frame)
identifier = add_tagname_if_needed identifier, tag_name if tag_name
identifier = element.watir_identifier_for identifier
return identifier, frame_identifiers, wait
end
end
end
end
end
Basically, the intent of this patch is that the Watir when_present method is always called. For example, your page object call will get translated to Watir as browser.link.when_present.click. In theory, it should get called for any method called on a page object element.
Unfortunately, there is a catch. There are some situations where you probably do not want to wait for the element to become present. For example, when doing page.link_element.when_not_visible, you would not want to wait for the element to appear before checking that it does not appear. In these cases, you can force the standard behaviour of not waiting by including :wait => false in the element locator:
page.link_element(:wait => false).when_not_visible

Resources