I have Padrino caching working in my app, e.g.
get :blog, cache: true do
# do a blog listing
end
But when the listings are paginated with will-paginate, it can't tell the difference between /blog and /blog?page=2, and always renders the cached copy of /blog. Is there any way to get this to work so that it caches per URL not per route?
Some spelunking in the Padrino issues provides this answer, which seems to work:
get :blog, cache: Padrino.config.cache do
cache_key { request.path_info + '?' + params.slice('page').to_param }
#do blog listing
end
The structure of the Padrino documentation seems to have changed since then, so the PR at the end of that issue no longer seems to be in the current documentation.
I'm likely doing something silly, missing a step, or something, but I can't seem to make digest caching work the way I believe it should.
My understanding is that, in rails 4, doing this:
- cache ['v1',#article] do
= render :partial => "show_article", :locals => { :article => #article}
Should build a cache digest that includes an MD5 of the view. And I see something like that in my logs:
Write fragment views/v1/articles/198-20130904195924000000000/2c68729b145522780d64dee67957c0e3
But, if I later change show_article.haml:
%h2 This should change the view's MD5.
Then reload the same page, I get:
Read fragment views/v1/articles/198-20130904195924000000000/2c68729b145522780d64dee67957c0e3
instead of a fresh render. Isn't the whole idea of digest caching that I DON'T have to update the "v1" string every time I edit a view file?
Or am I misunderstanding this?
This is made all the more difficult because in Rails 3 I could do this when using the cache_digests gem:
rake cache_digests:nested_dependencies TEMPLATE=articles/show
But that rake task doesn't exist in rails 4, even though the cache_digests gem is now part of it.
I updated Rails from 4.0.0 to 4.0.2 and the cache digest appears to be working correctly!
I'm building a simple app on the side using an API I made with Sinatra that returns some JSON. It's quite a bit of JSON, my app's API relies on a few hundred requests to other APIs.
I can probably cache the results for 5 days or so, no problem with the data at all. I'm just not 100% sure how to implement the caching. How would I go about doing that with Sinatra?
Personally, I prefer to use redis for this type of things over memcached. I have an app that I use redis in pretty extensively, using it in a similar way to what you described. If I make a call that is not cached, page load time is upwards of 5 seconds, with redis, the load time drops to around 0.3 seconds. You can set an expires time as well, which can be changed quite easily. I would do something like this to retrieve the data from the cache.
require 'redis'
get '/my_data/:id' do
redis = Redis.new
if redis[params[:id]]
send_file redis[params[:id]], :type => 'application/json'
end
end
Then when you wanted to save the data to the cache, perhaps something like this:
require 'redis'
redis = Redis.new
<make API calls here and build your JSON>
redis[id] = json
redis.expire(id, 3600*24*5)
get '/my_data/:id' do
# security check for file-based caching
raise "invalid id" if params[:id] =~ /[^a-z0-9]/i
cache_file = File.join("cache",params[:id])
if !File.exist?(cache_file) || (File.mtime(cache_file) < (Time.now - 3600*24*5))
data = do_my_few_hundred_internal_requests(params[:id])
File.open(cache_file,"w"){ |f| f << data }
end
send_file cache_file, :type => 'application/json'
end
Don't forget to mkdir cache.
alternatively you could use memcache-client, but it will require you to install memcached system-wide.
I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.
Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?
A good way of doing this type of thing is to use the (AWESOME) VCR gem.
Here's an example of how you would do it:
require 'vcr'
require 'mechanize'
# Setup VCR's configs. The cassette library directory is where
# all of your "recordings" are saved as YAML files.
VCR.configure do |c|
c.cassette_library_dir = 'vcr_cassettes'
c.hook_into :webmock
end
# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
a = Mechanize.new
a.get('http://google.com/')
end
As you can see... VCR records the communication as a YAML file on the first run:
mario$ find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml
If you want to have VCR create new versions of the cassettes, just delete the corresponding file.
I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.
I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.
I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.
Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.
If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.
# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]
# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)
I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:
# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
# for in-memory storage, you could use a Hash.
# or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)
def get_cached(uri)
cache_key = "_cache/#{uri}"
if args = storage[cache_key]
logger.debug("getting (cached) #{uri}")
uri, response, body, code = args
page = Mechanize::Page.new(uri, response, body, code, agent)
agent.send(:add_to_history, page)
page
else
logger.debug("getting (UNCACHED) #{uri}")
page = agent.get(uri)
storage[cache_key] = [page.uri, page.response, page.body, page.code]
page
end
end
end
Which you could use like this:
require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'
storage = {}
foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")
pp storage
Which prints the following information:
D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
[#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
{"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"87",
"connection"=>"close",
"content-type"=>"text/plain"},
"Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
"200"],
"_cache/http://ifconfig.me/encoding"=>
[#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
{"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"42",
"connection"=>"close",
"content-type"=>"text/plain"},
"gzip,deflate,identity\n",
"200"]}
How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?
Right now, I do a
get '/' do
set :base_url, "#{request.env['rack.url_scheme']}://#{request.env['HTTP_HOST']}"
# ...
haml :index
end
to be able to use options.base_url in the HAML index.haml.
But I am sure there is a far better, DRY, way of doing this. Yet I cannot see, nor find it. (I am new to Sinatra :))
Somehow, outside of get, I don't have request.env available, or so it seems. So putting it in an include did not work.
How do you get your base url?
You can get it using request.base_url too =D (take a look at rack/request.rb)
A couple things.
set is a class level method, which means you are modifying the whole app's state with each request
The above is a problem because potentially, the base url could be different on different requests eg http://foo.com and https://foo.com or if you have multiple domains pointed at the same app server using DNS
A better tactic might be to define a helper
helpers do
def base_url
#base_url ||= "#{request.env['rack.url_scheme']}://#{request.env['HTTP_HOST']}"
end
end
If you need the base url outside of responding to queries(not in a get/post/put/delete block or a view), it would be better to set it manually somewhere.