I'm using open-uri and nokogiri with ruby to do some simple webcrawling.
There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar.
What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?
Currently my script looks like:
require 'nokogiri'
require 'open-uri'
url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE))
puts doc.at_css("h2").text
What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.
I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'
doc = Nokogiri::HTML.parse(browser.html)
This might open a browser window though.
Related
I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']
I am new to Ruby and we are using Ruby Selenium framework for automating the PDF verification testing.
I want to verify the content of PDF, like text and also get the position of the text. Along with that I also need to get the text at a given position.
Something like this maybe
require 'pdf-reader'
require 'open-uri'
reader = PDF::Reader.new(open("SAMPLE_URL")) # my resume pdf
page = reader.pages.first
lines = page.split("\n")
text_match_line_numbers = [0...lines.length].select do |i|
lines[i] .include? "text"
end
Look at their docs here, there are more advanced options for navigating the PDF page.
I'm having trouble scraping the rows from "List of Nobel laureates" in Nokogiri.
I believe my CSS selector is correct, but it's returning empty.
The original tutorial is "Writing a Web Crawler".
require 'rubygems'
require 'nokogiri'
require 'open-uri'
BASE_WIKIPEDIA_URL = 'http://en.wikipedia.org/'
LIST_URL = "#{BASE_WIKIPEDIA_URL}/wiki/List_of_Nobel_laureates"
page = Nokogiri::HTML(open(LIST_URL))
rows = page.css('div#content.mw-body div#bodyContent div#mw-content-text.mw-content-ltr table.wikitable.sortable.jquery-tablesorter tr')
puts "length : #{rows.size}"
There are two problems:
You have a double slash in the URL you are building, so you're not actually looking at the page you think you're looking at. This is the URL you are using: http://en.wikipedia.org//wiki/List_of_Nobel_laureates, if you follow the link you'll see that it redirects to the Wikipedia homepage.
Your CSS selector is far too specific, and includes some information that won't be present in the raw page source. You should try a more simple selector:
rows = page.css('table.wikitable tr')
Specifically you are including the jquery-tablesorter class in your selector. This class is added by JavaScript, but the tools you're using don't execute the page's JavaScript, so the class won't be present and you can't use it to find table rows.
If you use "view source", instead of the your browser's DOM inspector tool, you will see the raw source code without any JavaScript applied.
I can see that you're expecting to a table with class jquery-tablesorter. That's because you're inspecting the table in your browser and it has that class. The problem is that jquery adds that class after the page loads. But since open-uri doesn't process javascript, that class never gets added to the table that nokogiri sees.
Long story short, you probably want to go with just:
page.css('table.wikitable tr')
I am trying to write code that would take a user-input URL and print that page's HTML content to the screen. The goal is to eventually create a script that will organize all of the text on the page.
How would I edit the code below to accomplish retrieving the HTML?
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com').search("p.posted")
print ""
Sounds like you want the .body method:
puts page.body
There are three forms on the page. All forms default to "today" for their date ranges. Each form is iteratively submitted with a date from a range (1/1/2013 - 1/3/2013, for example) and the resulting table is scraped.
The script then submit the date to the next form in line and again, the table is scraped. However, the scraping is occurring before the dates are submitted.
I tried adding sleep 2 in between scrapes to no avail.
The script is here: https://gist.github.com/hnanon/de4801e460a31d93bbdc
The script appears to assume that Nokogiri and Watir will always be in sync. This is not correct.
When you do:
page = Nokogiri::HTML.parse(browser.html)
Nokogiri gets the browser html at that one specific point in time. If Watir makes a change to the browser (ie changes the html), Nokogiri will not know about it.
Each time you want to parse the html with Nokogiri, you need to create a new Nokogiri object using the browser's latest html.
An example to illustrate:
require 'watir-webdriver'
require 'nokogiri'
b = Watir::Browser.new
b.goto 'www.google.ca'
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will be the Google page
b.goto 'www.yahoo.ca'
p page
#=> This will still be the Google page
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will now be the Yahoo page