Here sample code:
require 'nokogiri'
require 'open-uri'
begin
doc = Nokogiri::HTML(open(url))
rescue
puts "Fehler ist aufgetretten..."
end
Some parts of the page are loaded asynchronous and i'm missing some values, which are loaded later. Is there any way to open the url, to wait 10 seconds and after that to assign it to the variable doc? Any solutions/ideas with bash/lynx/wget are welcome too :)
Unfortunately, waiting 10 seconds isn't going to work, because neither open-uri nor Nokogiri will execute the javascript that loads content asynchronously. You'll need to use a browser driver like Watir or Watir-webdriver. If JRuby is an option, you can use Celerity which is a browser emulator that supports some javascript (using the Watir API), and will likely perform the task you need.
Related
With the aim of setting up a lab, I am in the process of doing tests with Sinatra :
require 'sinatra'
require 'open-uri'
get '/' do
format 'RESPONSE: %s', open(params[:url]).read
end
It works if I use the HTTP wrapper like http://localhost:4567/?url=https://www.google.com but I'm surprised that it doesn't work with other wrappers like file:// to try for example http://localhost:4567/?url=file:///etc/passwd.
On my system I can of course do curl file:///etc/passwd and it works.
Any idea ?
Thank's
I am trying to build a simple crawler that can login to Pinterest and pin a few things to my board.
The first step of this is successfully login. I read through the documentation and it seems like this should work but it doesn't.
When I run the code I expect it to print out a title like "Mary... is mary... on Pinterest"
But instead the title of the page is "Pinterest-The Visual Discovery Tool"
I think there's something wrong with my script.
require 'rubygems'
require 'mechanize'
require 'pry'
a = Mechanize.new
a.get('https://www.pinterest.com/login/') do |page|
form = page.forms.first
form.fields[0].value = "m...#gmail.com"
form.fields[1].value = "some_password"
new_page = form.submit
puts new_page.title
end
Keep in mind that mechanize has no capability of executing javascript and if the page depends on javascript, it may not load correctly. Although I only did a light read through of the source, it looks like it is very dependent on javascript and therefore can't be crawled effectively with mechanize.
Another option might be to use a headless browser like watir or selenium.
I am trying to use Watir to get the source code of Facebook after I authenticate using Watir. It gives this specific error.
/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/net/protocol.rb:158:in `rescue in rbuf_fill': Net::ReadTimeout (Net::ReadTimeout)
I believe that because there are too many AJAX requests in the homepage, webdriver detects it as the page is not fully loaded. So after I logged in, I did this:
p "starts"
Watir::Wait.until {
browser.div(:'class' => '_586i').exists?
}
p "finishes"
But after it prints "starts" then it gives a timeout error, and doesn't get the source code of the website.
I've been getting this error for some websites quite a lot after I try to, eg, browser.button.click that is redirecting to another page heavily loaded with Ajax. I found this:
browser.execute_script('document.getElementsByTag('button')[0].click()')
sleep 10
with adjusted sleep (or, much better, .wait_until_present) helps.
You can force the browser to wait until all ajax calls has been loaded with
sleep(1) until browser.execute_script("return jQuery.active") == 0
I want to capture screenshot of the browser URL section.
browser.screenshot.save ('tdbank.png')
It will save the entire page of internal part of the browser, but I want to capture the URL header part of the browser. Any suggestion?
Sometime, URL is saying http or https. I want to capture this in screenshot and archive it. I know I could get it through,
url = browser.url
then do some comparison. I need this for legal purpose and it should be done by taking a screenshot.
thanks in advance.
If you're on windows, you could use the win32screenshot gem. For example:
require 'watir-webdriver'
require 'win32/screenshot'
b = Watir::Browser.new # using firefox as default browser
b.goto('http://www.example.org')
Win32::Screenshot::Take.of(:window, :title => /Firefox/).write("image.bmp")
b.close
I am trying to parse the URL returned from the foursquare api (the callback URL) the problem is that the request comes in this format
0.0.0.0:4567/foursquare#access_token=KCZGA4JIR4N3QXXAASZTZRYWHU2TYJITM53LARSKHRVFPHQ
as you can see that hashtag is breaking havoc in my code because is nowhere to be found using request.url or the whole request object for that matter.
Has anyone solved this? I am not trying to authenticate, I already do that from inside the iOS app.
require 'sinatra'
require 'json'
require 'dm-core'
require 'dm-validations'
require 'dm-timestamps'
require 'dm-migrations'
require 'dm-ar-finders'
# where foursquare sent us after authorization
get "/foursquare" do
puts "Receiving ..." + request.url
end
Probably not what you want to hear, but a quick fix would be let your Sinatra (assuming that given your port number) do the authentication instead of the iOS app. This way you can take advantage of the omniauth-foursquare gem, https://github.com/arunagw/omniauth-foursquare/blob/master/lib/omniauth/strategies/foursquare.rb, which will do most of the parsing for you.
According to Foursquare's API page, https://developer.foursquare.com/resources/client, they recommend doing a web-based authentication too.