I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:
I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly: results = all("//a[contains(#onclick, 'analyticsLog')]").uniq. I know that the xpath that I have chosen to extract hrefs: //a[contains(#onclick, 'analyticsLog')] will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.
My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.
#
code below
#
require "capybara/dsl"
require "spreadsheet"
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//a[contains(#onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
#work_list[#row, 1] = info.find("//div[contains(#class, 'price font left')]").text
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
#work_list[#row, 4] = info.find("//select[contains(#name, 'options[747]')]//*[#price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
#work_list[#row, 5] = info.find("//select[contains(#name, 'options[748]')]//*[#price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
For Question 1: Get unique elements
All of the elements returned by all are unique. Therefore, I assume by "unique" elements, you mean that the "onclick" attribute is unique.
The collection of elements returned by Capybara is an enumerable. Therefore, you can convert it to an array and then take the unique element's based on their onclick attribute:
results = all("//a[contains(#onclick, 'analyticsLog')]")
.to_a.uniq{ |e| e[:onclick] }
Note that it looks like the duplicate links are due to one for the image and one for the text below the image. You could scope your search to just one or the other and then you would not need to do the uniq check. To scope to just the text link, use the fact that the link is a child of an h5:
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
For Question 2: Capture text if element present
To solve your second problem, you could use first to locate the element. This will return the matching element if one exists and nil if one does not. You could then save the text if the element is found.
For example:
el = info.first("//select[contains(#name, 'options[747]')]//*[#price='0']")
#work_list[#row, 4] = el.text if el
If you want the text of all matching elements, then use all:
options = info.all(".//select[contains(#name, 'options[747]')]//*[#price='0']")
#work_list[#row, 4] = options.collect(&:text).join(', ')
When there are multiple matching options, you will get something like "Green, Pink". If there are no matching options, you will get "".
Related
Hi I am just doing a bit of refactoring on a small cli web scraping project I did in Ruby and I was simply wondering if there was cleaner way to write a particular section without repeating the code too much.
Basically with the code below, I pulled data from a website but I had to do this per page. You will notice that both methods are only different by their name and the source.
def self.scrape_first_page
html = open("https://www.texasblackpages.com/united-states/san-antonio")
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
def self.scrape_second_page
html = open('https://www.texasblackpages.com/united-states/san-antonio?page=2')
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
Is there a way for me to streamline this process all with just one method pulling from one source, but have the ability to access different pages within the same site, or this is pretty much the best and only way? They owners of the website do not have a public api from me to pull from in case anyone is wondering.
Remember that in programming you want to steer towards code that follows the Zero, One or Infinity Rule avoid the dreaded two. In other words, write methods that take no arguments, fixed arguments (one), or an array of unspecified size (infinity).
So the first step is to clean up the scraping function to make it as generic as possible:
def scrape(page)
doc = Nokogiri::HTML(open(page))
# Use map here to return an array of Business objects
doc.css('div.grid_element').map do |business|
Business.new.tap do |biz|
# Use tap to modify this object before returning it
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
end
end
Note that apart from the extraction code, there's nothing specific about this. Takes a URL, returns Business objects in an Array.
In order to generate pages 1..N, consider this:
def pages(base_url, start: 1)
page = start
Enumerator.new do |y|
loop do
y << base_url % page
page += 1
end
end
end
Now that's an infinite series, but you can always cap it to whatever you want with take(n) or by instead looping until you get an empty list:
# Collect all business from each of the pages...
businesses = pages('https://www.texasblackpages.com/united-states/san-antonio?page=%d').lazy.map do |page|
# ...by scraping the page...
scrape(page)
end.take_while do |results|
# ...and iterating until there's no results, as in Array#any? is false.
results.any?
end.to_a.flatten
The .lazy part means "evaluate each part of the chain sequentially" as opposed to the default behaviour of trying to evaluate each stage to completion. This is important or else it will try and download an infinite number of pages before moving to the next test.
The .to_a on the end forces that chain to run to completion. The .flatten squishes all the page-wise results into a single result set.
Of course if you want to scrape the first N pages, it's a lot easier:
pages('https://www.texasblackpages.com/.../san-antonio?page=%d').take(n).flat_map do |page|
scrape(page)
end
It's almost no code!
This was suggested by #Todd A. Jacobs
def self.scrape(url)
html = open(url)
doc = Nokogiri::HTML(html)
doc.css('div.grid_element').each do |business|
biz = Business.new
biz.name = business.css('a b').text
biz.type = business.css('span.hidden-xs').text
biz.number = business.css('span.sm-block.lmargin.sm-nomargin').text.gsub("\r\n","").strip
end
The downside is with there not being a public api I had to invoke the method as many times as I need it since the url's are representing different pages within the wbesite, but this is fine because I was able to get rid of the repeating methods.
def make_listings
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=2")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=3")
Scraper.scrape("https://www.texasblackpages.com/united-states/san-antonio?page=4")
end
i ever had some problem with you, i do loop though. usually if the page support pagination then the first page it have chance to use query param page also.
def self.scrape
page = 1
loop do
url = "https://www.texasblackpages.com/united-states/san-antonio?page=#{page}"
# do nokogiri parse
# do data scrapping
page += 1
end
end
you can have break on certain page condition.
I'm trying to scrape a website however I cannot seem to get my while-loop to break out once it hits a page with no more information:
def scrape_verse_items(keyword)
pg = 1
while pg < 1000
puts "page #{pg}"
url = "https://www.bible.com/search/bible?page=#{pg}&q=#{keyword}&version_id=1"
doc = Nokogiri::HTML(open(url))
items = doc.css("ul.search-result li.reference")
error = doc.css('div#noresults')
until error.any? do
if keyword != ''
item_hash = {}
items.each do |item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
item_hash[title] = content
end
else
puts "Please enter a valid search"
end
if error.any?
break
end
end
pg += 1
end
item_hash
end
puts scrape_verse_items('joy')
I know this doesn't exactly answer your question, but perhaps you might consider using a different approach altogether.
Using while and until loops can get a bit confusing, and usually isn't the most performant way of doing things.
Maybe you would consider using recursion instead.
I've written a small script that seems to work :
class MyScrapper
def initialize;end
def call(keyword)
puts "Please enter a valid search" && return unless keyword
scrape({}, keyword, 1)
end
private
def scrape(results, keyword, page)
doc = load_page(keyword, page)
return results if doc.css('div#noresults').any?
build_new_items(doc).merge(scrape(results, keyword, page+1))
end
def load_page(keyword, page)
url = "https://www.bible.com/search/bible?page=#{page}&q=#{keyword}&version_id=1"
Nokogiri::HTML(open(url))
end
def build_new_items(doc)
items = doc.css("ul.search-result li.reference")
items.reduce({}) do |list, item|
title = item.css("h3").text.strip
content = item.css("p").text.strip
list[title] = content
list
end
end
end
You call it by doing MyScrapper.new.call("Keyword") (It might make more sense to have this as a module you include or even have them as class methods to avoid the need to instantiate the class.
What this does is, call a method called scrape and you give it the starting results, keyword, and page. It loads the page, if there are no results it returns the existing results it has found.
Otherwise it builds a hash from the page it loaded, and then the method calls itself, and merges the results with the new hash it just build. It does this till there are no more results.
If you want to limit the page results you can just change this like:
return results if doc.css('div#noresults').any?
to this:
return results if doc.css('div#noresults').any? || page > 999
Note: You might want to double-check the results that are being returned are correct. I think they should be but I wrote this quite quickly, so there could always be a small bug hiding somewhere in there.
I am trying to parse a website using Selenium/Capybara. Right now it looks like this:
session = Capybara::Session.new(:selenium)
session.visit "https://somesite.com/page1"
element = session.all( :css, '.table .row a' ).each do |el|
el.click
# get some element's data
session.evaluate_script('window.history.back()')
end
# repeat
Problem is, when I'm trying to get the data from the second page I am on, Capybara tells me that Either the element is no longer attached to the DOM or the page has been refreshed. which absolutely makes sense, however I'm struggling to find a way to create a new DOM variable and parse it in the documentation.
Same thing happens if I navigate back trying to repeat the actions and click on the second link in a row. I assume I need to re-create the session or is there a better way?
To work like you're trying you're going to need to keep a counter and find the elements each time through your loop - something along the lines of
counter = 0
while (el = session.all( :css, '.table .row a', minimum: 1 )[counter]) do
el.click
# get some element's data
counter += 1
session.go_back
end
or if the links are just standard you could gather the hrefs and then just visit them
element = session.all( :css, '.table .row a', minimum:1 ).map {|a| a['href']} do |url|
session.visit(url)
# get some element's data
end
I've got an array of certain <a> elements like this:
array = browser.as(:class => 'foo')
I try to go to the links like this:
$i = 0
while $i < num do
browser.goto array[$i].href
.
.
.
$i += 1
end
Which works the for the first loop, but not for the second. Why is this happening? If I do
puts array[1].href
puts array[2].href
puts array[-1].href
before
browser.goto array[$i].href
it shows all the links on the first loop in terminal window.
Someone who knows this better than I do will need to verify / clarify. My understanding is that elements are given references when the page is loaded. On loading a new page you are building all new references, even if it looks like the same link on each page.
You are storing a reference to that element and then asking it to go ahead and pull the href attribute from it. It works the first time, as the element still exists. Once the new page has loaded it no longer exists. The request to pull the attribute fails.
If all you care about are the hrefs there's a shorter, more Ruby way of doing it.
array = []
browser.as(:class => 'foo').each { |link|
array << link.href
}
array.each { |link|
browser.goto link
}
By cramming the links into the array we don't have to worry about stale references.
Yes, unlike with individual elements that store the locator and will re-look them up on page reloads, an element collection stores only the Selenium Element which can't be re-looked up.
Another solution for when you need this pattern and more than just an href value is not to create a collection and to use individual elements with an index:
num.times do |i|
link = browser.link(class: 'foo', index: i)
browser.goto link.href
end
In my Rails controller code I would like to randomly retrieve three of each content:
#content = Content.includes(:author).find(params[:id])
content_sub_categories = #content.subcategories
related_content = []
content_sub_categories.each do |sub_cat|
related_content << sub_cat.contents
end
#related_content = related_content.rand.limit(3)
rand.limit(3) isn't working, and the errors include:
undefined method `limit' for #<Array:0x007f9e19806bf0>
I'm familiar with Rails but still in the process of learning Ruby. Any help would be incredibly appreciated.
Perhaps it could be I am also rendering out the content in this way <%= #related_content %>?
I'm using:
Rails 3.2.14
Ruby 1.9.3
limit is a a method on ActiveRecord relations (that adds LIMIT X) to the SQL generated. However you have an array not a relation, hence the error.
The equivalent array method is take. You can of course combine both the shuffling and the limit into one step by using the sample method
If you want to pick 3 random elements, use Array#sample:
related_content.sample(3)
This should work:
related_content = []
content_sub_categories.each do |sub_cat|
related_content << sub_cat.contents.sample(3) # add 3 random elements
end
#related_content = related_content
Or without temporary variables using map:
#related_content = #content.subcategories.map { |cat| cat.contents.sample(3) }
Note that #related_content is an array of (3-element) arrays.
How is this ?
a = (1..10).to_a
p a.sample(3)
# >> [4, 10, 7]
Here's the final answer for finding the content id's subcategories, all of these subcategory's contents and displaying the content without repeats:
def show
#content = Content.includes(:author).find(params[:id])
related_content = #content.subcategories.pluck(:id)
#related_content = Content.joins(:subcategories).published.order('random()').limit(3).where(subcategories: { id: related_content}).where('"contents"."id" <> ?', #content.id)
end