I'm trying to make an accurate estimate of CDN usage on Google Cloud Platform but am not sure about the fill costs.
Fill costs are incurred on a cache miss and the data is gotten from origin or another cache. What's not specifically mentioned is how granular a "cache" miss is. That is - is it a cache miss for the region? zone? POP? node?
With an international distribution this could be make a huge difference in estimation.
According to the official documentation, cache fill charges vary based on source and destination. Source is the region of the origin server or, in the case of cache-to-cache cache fill, the region of the source cache. Destination is a geographic area determined by client IP address.
link
I asked Google support directly on this one and got back that cache fills occur in each "cache site." Or as they put it:
Cache fill is counted for each caching sites since cache fill occured between one cache location to another cache location.
The updated list of cache sites/locations is in their documentation.
At the time of writing that means a hypothetical max of 81 cache fills for a given result (not including expiring or being pushed out of the cache and re-filling etc.) - presuming your content is requested from each of these location as the cache is only filled when requested.
Related
I'm trying to come up with a somewhat simple solution to do a two week cache on millions of pages of content. The site in question is hitting MYSQL every time a page is hit some of the pages with more complex queries are taking 2-3 seconds load, My end goal is to get load times to under a second. I was thinking about using Memcached, but I would like to avoid this approach if possible. I basically would prefer a solution that crawls all the pages in question and automatically creates a fresh cache every two weeks. I'm open to all approaches including using a service.
Memcached or Redis are perfectly viable solutions to store "projected formats of data" that would require lots of JOINs, GROUP BYs, or ORDER BYs in MySQL.
However, even when accessing caches like Memcached or Redis, the code still has to be accessed. At high scale, for large amounts of data, the PHP runtime and your webserver can become a bottleneck.
Varnish to the rescue
You did mention the term pages, which implies you're actually trying to cache full pages instead of just data sets. In that case I would advise you to have a look at Varnish.
Varnish is a reverse caching proxy that is purposely built to caches pages at enormous scale. You can use a crawler to warm up the cache and you can leverage Cache-Control headers to control the Time To Live of objects in the cache.
Here's an example that sets the TTL for an HTTP response to 2 weeks:
Cache-Control: public, s-maxage=1209600
You can also set the TTL much higher, and then invalidate specific objects in the cache by purging them.
Caching millions of objects
Varnish is perfectly able to cache millions of objects, maybe even billions. The feasibility primarily depends on the size of your HTTP responses, and the amount of memory your system has.
By default Varnish stores its objects in memory. A configurable parameter in Varnish is the amount of memory that is allocated. You can easily allocate 80% of your system's memory to the Varnish process. The overhead of storing an object in cache is just 1 KB per object.
If your cached objects are just plain text, there should be no issue. If it's binary data (e.g.: images), then you can run out of memory quite quickly.
Running out of memory is not disastrous: an LRU mechanism will ensure that when the cache is full, the Least Recently Used objects are removed to clear space.
Conclusion
Varnish has become the de facto standard for page caching. The user guide on the website is a great resource to quickly learn how to setup and configure Varnish.
Assume some distributed CRUD Service that uses a distributed cache that is not read-through (just some Key-Value store agnostic of DB). So there are n server nodes connected to m cache nodes (round-robin as routing). The cache is supposed to cache data stored in a DB layer.
So the default retrieval sequence seems to be:
check if data is in cache, if so return data
else fetch from DB
send data to cache (cache does eviction)
return data
The question is whether the individual service nodes can be smarter about what data to send to the cache, to reduce cache capacity costs (achieve similar hit ratio with less required cache storage space).
Given recent benchmarks on optimal eviction/admission strategies (in particular LFU), some new caches might not even store data if it is deemed too infrequently used, maybe application nodes can do some best-effort guess.
So my idea is that the individual service nodes could evaluate whether data that was fetched from a DB should be send to the distributed cache or not based on an algorithm like LFU, thus reducing the network traffic between service and cache. I am thinking about local checks (suffering a lack of effectivity on cold startups), but checks against a shared list of cached keys may also be considered.
So the sequence would be
check if data is in cache, if so return data
else fetch from DB
check if data key is frequently used
if yes, send data to cache (cache does eviction). Else not.
return data
Is this possible, reasonable, has it already been done?
It is common in databases, search, and analytical products to guard their LRU caches with filters to avoid pollution caused by scans. For example see Postgres' Buffer Ring Replacement Strategy and ElasticSearch's filter cache. These are admission policies detached from the cache itself, which could be replaced if their caching algorithm was more intelligent. It sounds like your idea is similar, except a distributed version.
Most remote / distributed caches use classic eviction policies (LRU, LFU). That is okay because they are often excessively large, e.g. Twitter requires a 99.9% hit rate for their SLA targets. This means they likely won't drop recent items because the penalty is too high and oversize so that the victim is ancient.
However, that breaks down when batch jobs run and pollute the remote caching tier. In those cases, its not uncommon to see the cache population disabled to avoid impacting user requests. This is then a distributed variant of Postgres' problem described above.
The largest drawback with your idea is checking the item's popularity. This might be local only, which has a frequent cold start problem, or remote call which adds a network hop. That remote call would be cheaper than the traffic of shipping the item, but you are unlikely to be bandwidth limited. Likely you're goal would be to reduce capacity costs by a higher hit rate, but if your SLA requires a nearly perfect hit rate then you'll over provision anyway. It all depends on whether the gains by reducing cache-aside population operations are worth the implementation effort. I suspect that for most it hasn't been.
I know that CloudFront cache data in edge locations after first "miss" but is there a way to avoid that first miss by forcefully caching my content to all edge servers..?
Even if not an official AWS solution, any nice work-around would do. My current strategy is usually browsing content through VPN, but it's not very easy.
Lets take an example of Twitter. There is a huge cache which gets updated frequently. For example: if person Foo tweets and it has followers all across the globe. Ideally all the caches across all PoP needs to get updated. i.e. they should remain in sync
How does replication across datacenter (PoP) work for realtime caches ?
What tools/technologies are preferred ?
What are potential issues here in this system design ?
I am not sure there is a right/wrong answer to this, but here's my two pennies' worth of it.
I would tackle the problem from a slightly different angle: when a user posts something, that something goes in a distributed storage (not necessarily a cache) that is already redundant across multiple geographies. I would also presume that, in the interest of performance, these nodes are eventually consistent.
Now the caching. I would not design a system that takes care of synchronising all the caches each time someone does something. I would rather implement caching at the service level. Imagine a small service residing in a geographically distributed cluster. Each time a user tries to fetch data, the service checks its local cache - if it is a miss, it reads the tweets from the storage and puts a portion of them in a cache (subject to eviction policies). All subsequent accesses, if any, would be cached at a local level.
In terms of design precautions:
Carefully consider the DC / AZ topology in order to ensure sufficient bandwidth and low latency
Cache at the local level in order to avoid useless network trips
Cache updates don't happen from the centre to the periphery; cache is created when a cache miss happens
I am stating the obvious here, implement the right eviction policies in order to keep only the right objects in cache
The only message that should go from the centre to the periphery is a cache flush broadcast (tell all the nodes to get rid of their cache)
I am certainly missing many other things here, but hopefully this is good food for thought.
What will be the maximum data/URL(keys) can be stored per key in cache storage? By default chrome only show upto 49 URLs. Is 49 is maximum?
There isn't a hard maximum that I'm aware of. There are overall limits on the amount of storage space that a single origin has access to, and those limits apply to all forms of storage (Cache Storage API, IndexedDB, local storage, etc.), rather than applying to the number of keys in a single Cache instance.
Chrome's Cache Storage viewer in the Applications panel does show only 50 entries per page, but you can navigate forward and backwards to see additional entries beyond the first 50.