How do I retrieve HTML for content? - ruby

I am trying to write code that would take a user-input URL and print that page's HTML content to the screen. The goal is to eventually create a script that will organize all of the text on the page.
How would I edit the code below to accomplish retrieving the HTML?
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com').search("p.posted")
print ""

Sounds like you want the .body method:
puts page.body

Related

Trying to scrape an image using Nokogiri but it returns a link that I was not expecting

I'm doing a scraping exercise and trying to scrape the poster from a website using Nokogiri.
This is the link that I want to get:
https://a.ltrbxd.com/resized/film-poster/5/8/6/7/2/3/586723-glass-onion-a-knives-out-mystery-0-460-0-690-crop.jpg?v=ce7ed2a83f
But instead I got this:
https://s.ltrbxd.com/static/img/empty-poster-500.825678f0.png
Why?
This is what I tried:
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
title = html.search('.headline-1').text.strip
overview = html.search('.truncate p').text.strip
poster = html.search('.film-poster img').attribute('src').value
{
title: title,
overview: overview,
poster_url: poster,
}
It has nothing to do with your ruby code.
If you run in your terminal something like
curl https://letterboxd.com/film/glass-onion-a-knives-out-mystery/
You can see that the output HTML does not have the images you are looking for. You can see then in your browser because after that initial load some javascript runs and loads more resources.
The ajax call that loads the image you are looking for is https://letterboxd.com/ajax/poster/film/glass-onion-a-knives-out-mystery/std/500x750/?k=0c10a16c
Play with the network inspector of your browser and you will be able to identify the different parts of the website and how each one loads.
Nokogiri does not execute Javascript however the link has to be there or at least there has to be a link to some API that returns the link.
First place where I would search for it would be the data attributes of the image element or its parent however in this case it was hidden in an inline script along with some other interesting data about the movie.
First download the web page using curl or wget and open the file in text editor to see what Nokogiri sees. Search for something you know about the file, I searched for ce7ed2a83f part of the image url and found the JSON.
Then the data can be extracted like this:
require 'nokogiri'
require 'open-uri'
require 'json'
url = "https://letterboxd.com/film/glass-onion-a-knives-out-mystery/"
serialized_html = URI.open(url).read
html = Nokogiri::HTML.parse(serialized_html)
data_str = html.search('script[type="application/ld+json"]').first.to_s.gsub("\n",'').match(/{.*}/).to_s
data = JSON.parse(data_str)
data['image']

How can I click on a specific link with Nokogori or Mechanize?

I know how to find an element using Nokogiri. I know how to click a link using Mechanize. But I can't figure out how to find a specific link and click it. This seems like it should be really easy, but for some reason I can't find a solution.
Let's say I'm just trying to click on the first result on a Google search. I can't just click the first link with Mechanize, because the Google page has a bunch of other links, like Settings. The search result links themselves don't seem to have class names, but they're enveloped in <h3 class="r"></h3>.
I could just use Nokogiri to follow the href value of the link like so:
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').children.first['href']
new_document = open(href)
# href is equal to "/url?sa=t&rct=j&q=&esrc=s&source=web&url=https%3A%2F%2Fstackoverflow.com%2F"
but it's not a direct url, and going to that url gives an error. The data-href value is a direct url, but I can't figure out how to get that value - doing the same thing except with ...first['data-href'] returns nil.
Anyone know how I can just find the first .r element on the page and click the link inside it?
Here's the start to my action:
require 'open-uri'
require 'nokogiri'
require 'mechanize'
document = open("https://www.google.com/search?q=stackoverflow")
parsed_content = Nokogiri::HTML(document.read)
Here's the .r element on the Google search results page:
<h3 class="r">
Stack Overflow
</h3>
You should make sure your question is the correct code in your example - it looks like it is not, because you don't surround the url in quotes and the css selector is .r a not r. You use .r a because you want to access the link inside elements with the r class.
Anyway, you can use the approach detailed here like so:
require 'open-uri'
require 'nokogiri'
require 'uri'
base_url = "https://www.google.com/search?q=stackoverflow"
document = open(base_url)
parsed_content = Nokogiri::HTML(document.read)
href = parsed_content.css('.r').first.children.first['href']
new_url = URI.join base_url, href
new_document = open(new_url)
I tested this and following new_url does redirect to StackOverflow as expected.

How to get the position of text using 'PDF-Reader' gem in Ruby

I am new to Ruby and we are using Ruby Selenium framework for automating the PDF verification testing.
I want to verify the content of PDF, like text and also get the position of the text. Along with that I also need to get the text at a given position.
Something like this maybe
require 'pdf-reader'
require 'open-uri'
reader = PDF::Reader.new(open("SAMPLE_URL")) # my resume pdf
page = reader.pages.first
lines = page.split("\n")
text_match_line_numbers = [0...lines.length].select do |i|
lines[i] .include? "text"
end
Look at their docs here, there are more advanced options for navigating the PDF page.

Watir-webdriver is progressing through script before Nokogiri finishes scraping

There are three forms on the page. All forms default to "today" for their date ranges. Each form is iteratively submitted with a date from a range (1/1/2013 - 1/3/2013, for example) and the resulting table is scraped.
The script then submit the date to the next form in line and again, the table is scraped. However, the scraping is occurring before the dates are submitted.
I tried adding sleep 2 in between scrapes to no avail.
The script is here: https://gist.github.com/hnanon/de4801e460a31d93bbdc
The script appears to assume that Nokogiri and Watir will always be in sync. This is not correct.
When you do:
page = Nokogiri::HTML.parse(browser.html)
Nokogiri gets the browser html at that one specific point in time. If Watir makes a change to the browser (ie changes the html), Nokogiri will not know about it.
Each time you want to parse the html with Nokogiri, you need to create a new Nokogiri object using the browser's latest html.
An example to illustrate:
require 'watir-webdriver'
require 'nokogiri'
b = Watir::Browser.new
b.goto 'www.google.ca'
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will be the Google page
b.goto 'www.yahoo.ca'
p page
#=> This will still be the Google page
page = Nokogiri::HTML.parse(b.html)
p page
#=> This will now be the Yahoo page

HTML is read before fully loaded using open-uri and nokogiri

I'm using open-uri and nokogiri with ruby to do some simple webcrawling.
There's one problem that sometimes html is read before it is fully loaded. In such cases, I cannot fetch any content other than the loading-icon and the nav bar.
What is the best way to tell open-uri or nokogiri to wait until the page is fully loaded?
Currently my script looks like:
require 'nokogiri'
require 'open-uri'
url = "https://www.the-page-i-wanna-crawl.com"
doc = Nokogiri::HTML(open(url, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE))
puts doc.at_css("h2").text
What you describe is not possible. The result of open will only be passed to HTML after the open method as returned the full value.
I suspect that the page itself uses AJAX to load its content, as has been suggested in the comments, in this case you may use Watir to fetch the page using a browser
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.the-page-i-wanna-crawl.com'
doc = Nokogiri::HTML.parse(browser.html)
This might open a browser window though.

Resources