If you type in cache:www.92spoons.com, for example, into the Google search engine, it shows you a snapshot of the page from a time when Google snapshotted the site. I was just wondering, how often does Google refresh its cached data? It looks like, as of now, it was refreshed about 3 days ago. Also, do all sites' cached data update at the same time?
This is based on how often the website is changed. For example, Wikipedia may be updated several times a day, but 92spoons.com may be updated every few days. (source)
This also can be changed by popularity. You can visit this website which should allow you to refresh the cache. (source)
Related
I am trying to display dashboards to every user who accesses my website with an analysis of his previous data. Can I do it without the help of google analytics?
Well if you do not want to use Google-Analytics then YES you can do it.
Few steps you need to do:
Create a database table in which save page URL and visitor's ip
Also try to get time that how many seconds/minutes visitor stay on page
You may generate reports and show on dashboard
By the way, if we have a free of cost solution then why you are spending time on it ?
Paperclip is free and ethical: Website. Users can see all the data collected about them. I'm in their beta program, where you can suggest things to them and they'll add it.
I have a website forum where users exchange photos and text with one another on the home page. The home page shows 20 latest objects - be they photos or text. The 21st object is pushed out out of view. A new photo is uploaded every 5 seconds. A new text string is posted every second. In around 20 seconds, a photo that appeared at the top has disappeared at the bottom.
My question is: would I get a performance improvement if I introduced a CDN in the mix?
Since the content is changing, it seems I shouldn't be doing it. However, when I think about it logically, it does seem I'll get a performance improvement from introducing a CDN for my photos. Here's how. Imagine a photo is posted, appearing on the page at t=1 and remaining there till t=20. The first person to access the page (closer to t=1) will enable to photo to be pulled to an edge server. Thereafter, anyone accessing the photo will be receiving it from the CDN; this will last till t=20, after which the photo disappears. This is a veritable performance boost.
Can anyone comment on what are the flaws in my reasoning, and/or what am I failing to consider? Also would be good to know what alternative performance optimizations I can make for a website like mine. Thanks in advance.
You've got it right. As long as someone accesses the photo within the 20 seconds that the image is within view it will be pulled to an edge server. Then upon subsequent requests, other visitors will receive a cached response from the nearest edge server.
As long as you're using the CDN for delivering just your static assets, there should be no issues with your setup.
Additionally, you may want to look into a CDN which supports HTTP/2. This will provide you with improved performance. Check out cdncomparison.com for a comparison between popular CDN providers.
You need to consider all requests hitting your server, which includes the primary dynamically generated HTML document, but also all static assets like CSS files, Javascript files and, yes, image files (both static and user uploaded content). An HTML document will reference several other assets, each of which needs to be downloaded separately and thus incurs a server hit. Assuming for the sake of argument that each visitor has an empty local cache, a single page load may incur, say, ~50 resource hits for your server.
Probably the only request which actually needs to be handled by your server is the dynamically generated HTML document, if it's specific to the user (because they're logged in). All other 49 resource requests are identical for all visitors and can easily be shunted off to a CDN. Those will just hit your server once [per region], and then be cached by the CDN and rarely bother your server again. You can even have the CDN cache public HTML documents, e.g. for non-logged in users, you can let the CDN cache HTML documents for ~5 seconds, depending on how up-to-date you want your site to appear; so the CDN can handle an entire browsing session without hitting your server at all.
If you have roughly one new upload per second, that means there is likely a magnitude more passive visitors per second. If you can let a CDN handle ~99% of requests, that's a dramatic reduction in actual hits to your server. If you are clever with what you cache and for how long and depending on your particular mix of anonymous and authenticated users, you can easily reduce server loads by a magnitude or two. On the other side, you're speeding up page load times accordingly for your visitors.
For every single HTML document and other asset, really think whether this can be cached and for how long:
For HTML documents, is the user logged in? If no, and there's no other specific cookie tracking or similar things going on, then the asset is static and public for all intents and purposes and can be cached. Decide on a maximum age for the document and let the CDN cache it. Even caching it for just a second makes a giant difference when you get 1000 hits per second.
If the user is logged in, set the cache pragma to private, but still let the visitor's browser cache it for a few seconds. These headers must be decided upon by your forum software while it's generating the document.
For all other assets which aren't access restricted: let the CDN cache it for a long time and you can practically forget about ever having to serve those particular files ever again. These headers can be statically configured for entire directories in the web server.
I want to use Google Cache for visiting the webpages of other websites even without going at them.
If I fire a query like this http://webcache.googleusercontent.com/search?q=cache:<URL without SCHEME>, we can get the data.
I found/assume following things (Ques 0. please correct if any of them are wrong):
Google may or may not have cached information depending on the site's policy.
Google will anyways go to the website if any javascript has to be run.
Google just stores first 101 KB of the text.
Ques 1. I know Google cache only shows the recently crawled page but any idea of how old this data could be?
Ques 2. Is there any issue if I plan to go to Google cache for all the hits I make to that website (assuming that the website is cached and I am fine with little old page)?
Ques 3. Wayback Machine provides the data but it has huge delay between crawling and showing that data. Is there any directory where we can get recently archived data (like Wayback machine and Google cache)?
I know Google cache only shows the recently crawled page but any idea of how old this data could be?
Use the cache: operator in the URL
Is there any issue if I plan to go to Google cache for all the hits I make to that website (assuming that the website is cached and I am fine with little old page)?
Owners may request removal of content from the cache
Is there any directory where we can get recently archived data?
Use the tbs=qdr: query parameter in the URL
For Question 3, while it used to be the case that all Wayback Machine web captures were 6 months old, that was already becoming untrue in 2012, and is very untrue now in 2016. We have a ton of fresh content.
Looking at a lot of web applications (websites/services/whatever) that have a 'streaming' component (typically this is a 'Social' app): Think: Facebook's 'Wall', Twitter 'Feed', LinkedIn's 'News Feed'.
They have a pretty similar characteristic: 'A notice of new items is added to the page (automatically assuming via a background Ajax call', but the new HTML representing the newest feed items isn't loaded to the page until the users click this update link.'
I guess I'm curious if this design decision is for any of the following reasons and if so: could anyone whom has worked on one of these types of apps explain the reasoning they found for doing it this way:
User experience (updates for a large number of 'Facebook Friends' or
'Pages' or 'Tweets' would move too quickly for one to absorb and
read with any real intent, so the page isn't refresh automatically.
Client-side performance: fetching a simple 'count' of updates
requires less bandwidth (less loadtime), less JS running to update
the page for anyone whom has the site open, and thus a lighter
weight feel on the client-side.
Server-side performance: Fewer requests coming into the server to
gather more information about recent updates (less outgoing
bandwidth, more free cycles to be grabbing information for those
whom do request it (by clicking the link). While I'm sure the owners
of these websites aren't 'short on resources', if everyone whom had
Twitter or Facebook open in the browser got a full-update fetched
from the server every-time one was created I'm sure it would be a
much more sig. drag on resources.
They are actually trying to save resources (it takes a cup of coffee
to perform a Google search (haha)) and sending a few bytes of data
to the page representing the count of new updates is a lot lighter
of a load on applications that are being used simultaneously on
hundreds and thousands of browser windows (not to mention API
requests).
I have a few more questions depending on the answer to this first question as well...so I'll probably add those here or ask another question!!
Thanks
P.S. This question got trolled off of the 'Web Applications' site -- so I brought my questions here where they're not to 'broad' or 'off-topic' (-8
Until the recent UI changes to Facebook, they did auto-load new content. It was extremely frustrating from a user perspective, as you'd be reading through the list of your friend's posts and all of a sudden everything would shift and you'd have no idea where the post you were just reading went.
I'd imagine this is the main reason.
Situation: Google has indexed a page in a forum. The thread is now deleted. How/whether can I make Google and other search engines to delete the cached copy? I doubt they would have anything against that since the linked page does not exist anymore and keeping the index updated and valid should be in their best interests.
Is this possible or do I have to wait months for an index update? Or will the page now stay there forever?
I am not the owner of the respective site so I can't change robots.txt for example. I would like to force the update as the "third party".
I also noticed that a new page on that resource I created two days ago is already in the cache. Given that can I make an estimate how long will it take for a non valid page on this domain to be dropped?
EDIT: So I did the test. It took google shortly under 2 months to drop the page. Quite a long time...
It's damn near impossible to get it removed - however replacing the page with entirely blank content will ensure that you nuke the ranking of the page when it is respidered.
You can't really make Google delete anything, except perhaps in extreme circumstances. You can adjust your robots.txt file to promote a revisit interval that might update things sooner, but if it is a low traffic site, you might not get a revisit very soon.
EDIT:
Since you are not the site-owner, you can modify the meta tags on the page with "revisit-after" tags as discussed here.
You cant make search engines to remove the link but don't worry soon the link will be removed as the link will not longer be active. You need not wait for months for this to happen.
If your site is registered with Google Webmaster, you can request to remove pages from the index. It works, I tried and used it in the past.
EDIT: Since you are not the owner, I am afraid that this solution would not work.