Website does not allow continuous refresh? - caching

There's a website I would like to access every few seconds to get updated information. However, I realized that it "catches" that I want to do that and "blocks" my reload for about 5min. Meaning that, if I refresh it in my browser I don't get updated information.
If I go to a different computer, or my phone, or via the terminal (with python's urllib.request), and I access the website, it does show updated information (but again, only once, for every try I have to wait 5min per device). This is what makes me think they have a system to prevent continuous refreshes (I guess to improve performance) from the same address?
My first question is... is this a thing? If so, what's called?
My second question then is... is there a way to trick it to let me look at updated content every few seconds?

Related

Scrapy persistent cache

We need to be able to re-crawl historical data. Imagine today is 23rd of June. We crawl a website today but after a few days we realize we have to re-crawl it, "seeing" it exactly as it was on 23rd. That means, including all possible redirects, GET and POST requests etc. ALL the pages the spider sees, should be exactly as they were on 23rd, no matter what.
Use-case: if there is a change in the website, and our spider is unable to crawl something, we want to be able to get back "in the past" and re-run the spider after we fix it.
Generally, this should be quite easy - subclass the standard Scrapy's cache, force it to use dates for subfolders and have something like that:
cache/spider_name/2015-06-23/HERE ARE THE CACHED DIRS
but when I was experimenting with this, I realized sometimes the spider crawls the live website. That means, it doesn't take some pages from the cache (though the appropriate files exist on the disk) but instead it takes them from the live website. It happened with pages with captchas, in particular, but maybe with some other ones.
How can we force Scrapy to always take the page from the cache, not hitting the live website at all? Ideally, it should even work with no internet connection.
Update: we've used the Dummy policy and HTTPCACHE_EXPIRATION_SECS = 0
Thank you!
To do exactly what you want you should had this in your settings:
HTTPCACHE_IGNORE_MISSING = True
Then if enabled, requests not found in the cache will be ignored instead of downloaded.
When you are setting :
HTTPCACHE_EXPIRATION_SECS = 0
It only assure you that "cached requests will never expire" , but if a page isn't in your cache, then it will be download.
You can check the documentation.

2 instances of 1 image on a page

I am not really sure how to google this, so I thought I could ask here.
I have the same image, posted on a page twice, will that slow down the execution time or will it remain the same since I am using the same resource?
The browser should be able to cache the image the first time it is requested from the source server. Most of the popular browsers should have this implemented. It should not have to load it twice, just once on the initial request to the server and then using the cached version the second time.
This also assumes the end user has the browser caching enabled. If that is turned off (even if the browser supports it), then it will make that extra request for the image since the cache is not there to pull from.

Implement real-time updating notification feature

I'd like to implement some visual indicator of various sections for the items whose status is pending in my app similar to facebook's / google plus unread notification indicator...I have written an API to fetch the count to be displayed, but I am stuck at updating it everytime an item gets added or deleted, I could think of two approaches which I am not satisfied with, first one being making an API call related to the count whenever a POST or DELETE operation is performedSecond one being refreshing the page after some time span...
I think there should be much better way of doing this from server side, any suggestion or any gem to do so?
Even in gmail it is refreshed on client request. The server calculates the amount of new items, and the client initiates a request (probably with AJAX). This requires an almost negligible amount of data and process time, so probably you can get away with it. Various cache gems even can store the part of the page refreshed if no data changed since last request, which even solves the problem of calculating only when something changed.
UPDATE:
You can solve the problem basically two ways: server side push, and a client side query. The push is problematic, for various reasons, rarely used in web environment, at least as far as I know. Most of the pages (if not all) uses timed query to refresh such information. You can check it with the right tool, like firebug for firefox. You can see as individual requests initiated towards the server.
When you fire a request trough AJAX, the server replies you. Normally it generates a page fragment to replace the old content with the new, but some cache mechanism can intervene, and if nothing changed, you may get the previously stored cache fragment. See some tutorial here, for various gems, one of them may fit your needs.
If you would prefer a complete solution, check Faye (tutorial here). I haven't used it, but may worth a try, seems simple enough.

Kohana execution time is fast, but overall response time is slow, why?

I use the Kohana3's Profiler class and its profiler/stats template to time my website. In a very clean page (no AJAX, no jQuery etc, only load a template and show some text message, no database access), it shows the request time is 0.070682 s("Requests" item in the "profiler/stats" template). Then I use two microtime() to time the duration from the first line of the index.php to the last line of index.php, it shows almost very fast result. (0.12622809410095 s). Very nice result.
But if i time the request time from the browser's point of view, it's totally different. I use Firefox + Temper data add-on, it shows the duration of the request is 3.345sec! And I noticed that from the time I click the link to enter the website (firefox starts the animated loading icon), to when the browser finish its work(the icon animation stops), it really takes 3-4 seconds!!
In my another website which is built with WikkaWiki, the time measured by Temper Data is only 2190ms - 2432ms, including several access to mysql database.
I tried a clean installation of kohana, and the default plain hello-world page also loads 3025ms.
All the website i mentioned here are tested in the same "localhost" PC, same setting. Actually they are just hosted in different directories in the same machine. Only Database module is enabled in the bootstrap.php for kohana website.
I'm wondering why the kohana website's overall response is such slow while the php code execution time is just 0.126 second?? Are there anything I should look into?
==Edit for additional information ==
Test result on standard phpinfo() page is 1100-1200ms (Temper data)
Profiler shows you execution time from Kohana initialization to Profiler render call. So, its not a full Kohana time. Some kind of actions (Kohana::shutdown_handler(), Session::_destroy() etc) may take a long time.
Since your post confirms Kohana is finishing in a 1/10th of a second and less, it's probably something else:
Have you tested something else other than Kohana? It sounds like the server is at fault, but you can't be sure unless you compare the response times with something else. Try a HTML and pure PHP page.
The firefox profiler could be taking external media into consideration. So if you have a slow connection and you load Google Analytics, then that could be another problem.
Maybe there is something related with this issue: Firefox and Chrome slow on localhost; known fix doesn't work on Windows 7
Although the issue happens in Windows 7, maybe it can help...

How to know quantity of users with turned off images in browser?

I'm working on the quite popular website, which looks good if user has turned on "Load images" option in his browser's settings.
When you try to open the website with "turned off images" option, it becomes not usable, many components won't work, because user won't see "important" buttons(we don't use standard OS buttons).
So, we can't understand and measure negative business impact of this mistake(absent alt/title attributes).
We can't set priority for this task - because we don't know how much such users comes to our website.
Please give me some advice how this problem can be solved?
Look in the logs for how many hits you get on a page without the subsequent requests from the browser for the other images.
Of course the browser might have images cached, so look for the first time you get a hit.
You can even use IP address for this, since it's OK if you throw out good data (that is, hits that are new that you disregard). The question is just: Of the hits you know are first-time, how many don't get images?
If this is a public page (i.e. not a web application that you've logged in to), also disregard search engine bots to the greatest extent possible; they usually won't retrieve images.

Resources