Watir-webdriver is progressing through script before Nokogiri finishes scraping - ruby

There are three forms on the page. All forms default to "today" for their date ranges. Each form is iteratively submitted with a date from a range (1/1/2013 - 1/3/2013, for example) and the resulting table is scraped.
The script then submit the date to the next form in line and again, the table is scraped. However, the scraping is occurring before the dates are submitted.
I tried adding sleep 2 in between scrapes to no avail.
The script is here: https://gist.github.com/hnanon/de4801e460a31d93bbdc

The script appears to assume that Nokogiri and Watir will always be in sync. This is not correct.
When you do:
page = Nokogiri::HTML.parse(browser.html)
Nokogiri gets the browser html at that one specific point in time. If Watir makes a change to the browser (ie changes the html), Nokogiri will not know about it.
Each time you want to parse the html with Nokogiri, you need to create a new Nokogiri object using the browser's latest html.
An example to illustrate:
require 'watir-webdriver'
require 'nokogiri'
b = Watir::Browser.new
b.goto 'www.google.ca'
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will be the Google page
b.goto 'www.yahoo.ca'
p page
#=> This will still be the Google page
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will now be the Yahoo page

Related

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

How can I click on a specific link with Nokogori or Mechanize?

I know how to find an element using Nokogiri. I know how to click a link using Mechanize. But I can't figure out how to find a specific link and click it. This seems like it should be really easy, but for some reason I can't find a solution.
Let's say I'm just trying to click on the first result on a Google search. I can't just click the first link with Mechanize, because the Google page has a bunch of other links, like Settings. The search result links themselves don't seem to have class names, but they're enveloped in <h3 class="r"></h3>.
I could just use Nokogiri to follow the href value of the link like so:
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').children.first['href']
new_document = open(href)
# href is equal to "/url?sa=t&rct=j&q=&esrc=s&source=web&url=https%3A%2F%2Fstackoverflow.com%2F"
but it's not a direct url, and going to that url gives an error. The data-href value is a direct url, but I can't figure out how to get that value - doing the same thing except with ...first['data-href'] returns nil.
Anyone know how I can just find the first .r element on the page and click the link inside it?
Here's the start to my action:
require 'open-uri'
require 'nokogiri'
require 'mechanize'
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
Here's the .r element on the Google search results page:
<h3 class="r">
Stack Overflow
</h3>
You should make sure your question is the correct code in your example - it looks like it is not, because you don't surround the url in quotes and the css selector is .r a not r. You use .r a because you want to access the link inside elements with the r class.
Anyway, you can use the approach detailed here like so:
require 'open-uri'
require 'nokogiri'
require 'uri'
base_url = "https://www.google.com/search?q=stackoverflow"
document = open(base_url)
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').first.children.first['href']
new_url = URI.join base_url, href
new_document = open(new_url)
I tested this and following new_url does redirect to StackOverflow as expected.

How to mark a certain part of the text in watir?

Hello is there something that can only mark a certain part of the text?
I can not find the right solution anywhere.
I tried: double_click, flash, select_text didn't work for me.
This works, but this mark everything : browser.send_keys [:control, 'a']
I added picture of example, what i want to do.
Thank you for your answers.
The red rectangle shows the markings
You can use the Element#select_text method. Note that prior to Watir 6.8, you will need to manually include the extension (method).
Here is a working example using the Wikipedia page:
require 'watir'
require 'watir/extensions/select_text' # only include this if using Watir 6.7 or prior
browser = Watir::Browser.new
browser.goto('https://en.wikipedia.org/wiki/XPath')
browser.body.select_text('XPath may be used')
sleep(5) # so that you can see the selection
Note that this will highlight the first match. You may want to restrict searching a specific element rather than the entire body.
Here is another example using ckeditor.com:
require 'watir'
require 'watir/extensions/select_text' # only include this if using Watir 6.7 or prior
browser = Watir::Browser.new
browser.goto('ckeditor.com/')
frame = browser.iframe(class: 'cke_wysiwyg_frame')
frame.p.select_text('Bake the brownies')
browser.link(href: /Bold/).click
sleep(10)

How do I select a table row with Nokogiri?

I'm having trouble scraping the rows from "List of Nobel laureates" in Nokogiri.
I believe my CSS selector is correct, but it's returning empty.
The original tutorial is "Writing a Web Crawler".
require 'rubygems'
require 'nokogiri'
require 'open-uri'
BASE_WIKIPEDIA_URL = 'http://en.wikipedia.org/'
LIST_URL = "#{BASE_WIKIPEDIA_URL}/wiki/List_of_Nobel_laureates"
page = Nokogiri::HTML(open(LIST_URL))
rows = page.css('div#content.mw-body div#bodyContent div#mw-content-text.mw-content-ltr table.wikitable.sortable.jquery-tablesorter tr')
puts "length : #{rows.size}"
There are two problems:
You have a double slash in the URL you are building, so you're not actually looking at the page you think you're looking at. This is the URL you are using: http://en.wikipedia.org//wiki/List_of_Nobel_laureates, if you follow the link you'll see that it redirects to the Wikipedia homepage.
Your CSS selector is far too specific, and includes some information that won't be present in the raw page source. You should try a more simple selector:
rows = page.css('table.wikitable tr')
Specifically you are including the jquery-tablesorter class in your selector. This class is added by JavaScript, but the tools you're using don't execute the page's JavaScript, so the class won't be present and you can't use it to find table rows.
If you use "view source", instead of the your browser's DOM inspector tool, you will see the raw source code without any JavaScript applied.
I can see that you're expecting to a table with class jquery-tablesorter. That's because you're inspecting the table in your browser and it has that class. The problem is that jquery adds that class after the page loads. But since open-uri doesn't process javascript, that class never gets added to the table that nokogiri sees.
Long story short, you probably want to go with just:
page.css('table.wikitable tr')

HTML is read before fully loaded using open-uri and nokogiri

I'm using open-uri and nokogiri with ruby to do some simple webcrawling.
There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar.
What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?
Currently my script looks like:
require 'nokogiri'
require 'open-uri'
url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE))
puts doc.at_css("h2").text
What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.
I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'
doc = Nokogiri::HTML.parse(browser.html)
This might open a browser window though.

Resources