Heroku and Web scraping - ruby

I have a nokigiri web scraper that publishes to a database that I'm trying to publish to heroku. I have a sinatra application frontend that I want to have pull in from the database. I'm new to Heroku and web development, and don't know the best way to handle something like this.
Do I have to place the web scraper script that uploads to the database under a sinatra route (like mywebsite.com/scraper ) and just make it so obscure that no one visits it? In the end, I'd like to have the sinatra part be a rest api that pulls from the database.
Thanks for all input

There are two approaches you can take.
The first one is to use One-off dynos by running the scraper through the console using heroku run YOURCMD. Just make sure scraper don't write to disk but uses database.
More information:
https://devcenter.heroku.com/articles/one-off-dynos
The second is differentiating between scraper and web process in a way that you have web process for normal UI interaction and a scraper process which web process can spawn/talk to. If you take this route it's up to you how to protect it from rest of the world (auth/url obfuscation etc.).
More information:
https://devcenter.heroku.com/articles/background-jobs-queueing

I did it by creating a rake task and using the one-off dynos as mentioned by XLII
Here is my rake task file
require 'bundler/setup'
Bundler.require
desc "Scrape Site"
task :scrape, [:companyname] => :environment do |t, args|
puts "Company Name is :" + args[:companyname]
agent = Mechanize.new
agent.user_agent_alias = 'Mac Safari'
puts "Agent (Mac Safari Created)"
# MORE SCRAPING CODE
end
You can simply run it by call
heroku run rake scrape[google]

Related

Launch sinatra from a test/spec or another ruby script

I'm experimenting, and I'm trying to launch dummy Sinatra application from RSpec and kill it when the spec is finished. Something like:
# spec/some_spec.rb
before(:all)
# launch sinatra dummy app
end
after (:all)
# kill sinatra dummy app
end
it 'should return list of whatever' do
expect(JSON.parse(make_request('0.0.0.0:4567/test.json')))
.to include('whatever')
end
I could use system("ruby test/dummy/dummy_app.rb"), but how can I kill that process only? Does anyone know how I can launch the Sinatra inside a test (or from another ruby script)? I know about WebMocks, but I want to see if I can manage to make my test work this way.
Look under RSpec on "Testing Sinatra with Rack::Test". I'd suggest you use that code as boilerplate to get started.
Just add this to your describe block:
def app
Sinatra::Application
end
I would suggest you read up RSpec.
Since you want to test an external system, by the looks of your comment, instead of system "curl whatewer.com", you can use Net::HTTP to make requests and then test against the response.
Have a look at "Testing an external API using RSpec's request specs".
As I'm writing request specs to ensure the features won't be broken I decided to rather write separate Cucumber features. The nice thing is that I can use Capybara, and thanks to Selenium Web Drive, I can launch a server before I run my tests.
So, I created a dummy Sinatra application (that will represent the external service to which the actual code I'm testing is doing requests (including a nasty system('curl whatever.com')).
All I have to do is stub out the methods passed to curl to use Capybara.current_session.server.host and Capybara.current_session.server.port.
Once I'm done with my re-factoring all I have to do is remove the Capybara server variables, and Selenium web drive from the cucumber/capybara configuration.
Tests after a brief change will be still working and will be valid.
Update
In the end I wrote it all with RSpec request tests, as doing it in Cucumber was little bit time consuming and I already spend too much time on this.
I mark these kind of request tests with RSpec tag and Before I lunch these I manually lunch simple Sinatra/Grape dummy API application to which the request are made. (Then I run RSpec tests with this tag)
So basically I end up with specs for functionality that uses net/http that uses WebMock and don't need a server, and request tests for which I need to run the server before I run the specs. So the original question remains, how to lunch a server before tests start
After I cover all the functionality I'm gonig to rewrite the curl to net/http however I'm going to keep those requests specs as I discovered they are nice idea when it comes to crazy API scenarios (like testing https + diggested authentication)

Accessing Sinatra application scope from Rake task

I have a global variable in a Sinatra application that I want to update with a scheduled task from my Rakefile. Note that the application is hosted on Heroku. I have set up helpers in order to access this variable.
get '/' do
##var
end
helpers do
def get_var
return ##var
end
def set_var(value)
##var = value
end
end
Here is the task in my Rakefile:
task :do_something do
Sinatra::Application.set_var(get_data)
end
def get_data
# Retrieve value from external source
...
return val
end
The problem I run into is that the task executes properly, but the variable in the Sinatra application is never updated. I assume this is because calling Sinatra::Application from within the Rakefile actually creates a separate instance of the application from the primary instance that I am trying to update.
I want to know if their is a way to access the scope of the running Sinatra web app from within a Rakefile task.
*Note: I could just write the value retrieved in the scheduled task to a database and then access it from the Sinatra app, but that would be overkill because this variable is updated so infrequently but retrieved so regularly that I would prefer to keep it in memory for easier access. I have looked into Memcache and Redis in order to avoid turning to a database, but again I feel that this would be excessive for a single value. Feel free to disagree with this.
EDIT: In regards to Alexey Sukhoviy's comment, Heroku does not allow writing to files outside of the tmp directories, and these are not kept alive long enough to satisfy the needs of the application.
I ended up storing the variable using Memcache. This plays well with Heroku since they provide a free add-on for it. The Dalli gem provides a simple interface with Ruby for Memcache. In my Sinatra app file, I set the following options:
require 'dalli'
set :cache, Dalli::Client.new
I then am able to recover the stored variable from the Rakefile:
task :do_something do
Sinatra::Application.settings.cache.set('var', get_data)
end
I can then access this variable again in my Sinatra controller:
settings.cache.get('var')

Ruby/Cucumber/Capybara Testing Multipart File Uploads

I'm using Cucumber/Capybara to test a web application. I'm pretty much a complete beginner in Ruby and its a real testimony to the developers of Cucumber/Capybara just how far I have been able to test my application with only the miniscule amount of Ruby knowledge that I have.
However, as you've probably guessed, I've reach the point were I need some expert help. I need to test a multipart file upload. The problem is that the web application that I'm testing has a URL command interface, but no associated pages. So I can't just load the page, fill in a parameter and push a button. I have to format the POST command programatically.
Up until now, I have been interacting this the application exclusively using 'visit'. i.e. i have steps definitions such as:
Given /^I delete an alert with alertID "([^"]*)" from the site$/ do |alertID|
visit WEB_SITE_ROOT + "/RemoteService?command=deleteAlert&siteName=#{$Site}&alertID=#{alertID}"
end
But now I need to do some posts. I found some code that seems to do what I need:
Given /^I upload the "([^"]*)" file "([^"]*)" for the alert$/ do |fileType, fileName|
file = File.new(fileName, "rb")
reply = RestClient.post(
"#{WEB_SITE_ROOT}" + "/FileUploader?command=upload&siteName=#{$Site}&alertID=#{$OriginalAlertID}",
:pict => file,
:function => "#{fileType}",
:content_type => 'multipart/jpg',
)
end
But this is not running in the same cucumber/capybara session, and so is not authorised (one of the previous steps was a login). Also, the reply from the web application is not picked up by cucumber/capybara and so my test for success/failure do not work.
Can someone please point me in the right direction?
By default capybara uses the Rack::Test adapter which will bypass the HTTP server and interact with your Rack/your app directly. The POST request you're doing in your step won't go through capybara, hence why it's failing.
To upload files when using Rack::Test you'll need to use the Rails #fixture_file_upload method, which by default should be available in your cucumber steps.

API polling bot on heroku

I want to create a bot which makes API request per minute to some API url. This then needs to ping a particular user if data entry has changed against his name in the API feed. I want to go for a free solution on Heroku. Can this be achieved?
Yes, heroku supports thin as a web server, which is EventMachine enabled, so an easy way to do this is to write a quick sinatra app and use EM.add_periodic_timer for your API calls. When you deploy this sinatra app to heroku, it'll use thin by default, so there's no extra configuration needed. You can test via thin start -p 4567 assuming your config.ru is correct. Here's a pretty standard one, assuming your app is in app.rb:
require 'bundler/setup'
Bundler.require :default
require File.expand_path('app', File.dirname(__FILE__))
run Sinatra::Application
I currently check the status of some sites for free on heroku. The secret? Rufus-Scheduer.
Install the gem
gem install rufus-scheduler
Make sure you include the gem in bundler or however you are including it.
Then you need to create a file called task_scheduler.rb and stick this in you initializers directory.
require 'rufus/scheduler'
scheduler = Rufus::Scheduler.start_new
scheduler.every '1m' do
url = "http://codeglot.com"
response = Net::HTTP.get_response(URI.parse(url))
#do stuff with response.body
end
If you have any trouble you can see this blog post:
http://intridea.com/2009/2/13/dead-simple-task-scheduling-in-rails?blog=company

rails 3 access to model and mailer from external script

As in topic how i can access model and mailer from ruby script? I want to do some thinks which are not web related and can't be done via rails but are bound to database used in rails. For eg. i want to check when sb premium ends and 7 days before sand him email.
Best regards
Set up a rake task which you can call and check the output of.
In lib/tasks/appname.rake
task :do_some_stuff => :environment do
# any code you want in here.
puts "some output"
end
You can then call it from the command line with
rake do_some_stuff

Resources