Distributed cache with huge objects ~1-2GB

Distributed cache with huge objects ~1-2GB - caching

I have a need to stream a huge dataset, around 1-2GB, but only on demand when they explore the data. For example, if they don't explore parts of the data, I don't want to send it out.
So now, I have a solution that effectively returns JSON only for things they need. The need for a cache arises because these 1-2GB objects are actually constructed in memory from a file or files on disk, so the latency is ~30 seconds if you have to read the file to return this data.
How do I manage such a cache? Basically I think the solution is something like ZooKeeper or such where I store the physical machine name which holds the cache and then forward my rest request to that.
Would you guys also consider this to be the right model? I wonder what kind of checks will I have to do such that if the node that has the cache goes down, I can still fullfil the request without an error, but higher latencies.
Has anybody developed such a system? All the solutions out there are for seemingly small rows or objects.

https://github.com/golang/groupcache is used for bigger things, but although it's used by http://dl.google.com, I'm not sure how it would do with Multi-gigabyte objects.
On the other hand, HTTP can do partial transfers and will be very efficient at that. Take a look ar Varnish, or Nginx.

Related

Best way to construct a cache key whose uniqueness is defined by 6 properties

Currently I am tasked to fix cache for an ecommerce like system whose prices depend on many factors. The cache backend is redis. For a given product the factors that influence the price are:
sku
channel
sub channel
plan
date
Currently the cache is structured like this in redis:
product1_channel1_subchannel1: {sku_1: {plan1: {2019-03-18: 2000}}}
The API caters to requests for multiple products, skus and all the factors above . So they decided to query all the data on a product_channel_subchannel level and filter the data in the app which is very slow. Also they have decided that, on a cache miss they will construct the cache for all skus for 90 days of data. This way only one request will face the wrath while the others gets benefited from it (only the catch is now we are busting cache more often which is also dragging the system down)
The downside of going with all these factors included in the keys is there will be too many keys. To ball park there are 400 products each made up of 20 skus with 20 channels, 200 subchannels 3 types of plans and 400 days of pricing. To avoid these many keys at some place we must group the data.
The system is currently receives about 10 rps and the has to respond within 100ms.
Question is:
Is the above cache structure fine? Or how do we go about flattening this structure?
How are caches stored in pricing systems in general. I feel like this a very trivial task nonetheless I find it very hard to justify my approaches
Is it okay to sacrifice one request to warm cache for bulk of the data? Or is it better to have a cache warming strategy?

Any sort of caching strategy will be an exercise in trade-offs. And the precise trade-offs you need to make will be dependent upon complex domain logic that you can't predict until you try it out.
What this means is that whatever you implement should be based on data and should be flexible enough to change over time as the business changes. In particular the answer to these questions:
Is it okay to sacrifice one request to warm cache for bulk of the data? Or is it better to have a cache warming strategy?
depend on how the data will be queried by your users and how long a cache miss will take. If queries tend to be clustered around certain skus, or certain dates in a predictable manner, then you should use that information to help guide cache hits and misses.
There is no way I, or anyone else, can give you a correct answer without doing proper experimentation, but we can give you some guidelines.
Here are some best practices that I would recommend when using redis for caching:
If the bottleneck is sending data from redis to the api, then consider using lua scripts to do the simple processing before any data leaves redis. But, be careful that you don't make the scripts too complex since a long-running lua script can block all other parts of redis
It looks like you are using simple get/set keys to store your data. Consider using something more complex:
a. use sorted sets (zsets) if you want to have better access to data by date (use the date as the score).
b. use hash sets to get more fine-grained access to skus
Based on your question, it looks like you will have about 1.6M keys. This is not a huge amount, but you need to make sure that redis has enough memory to store everything in ram without swapping anything to disk. This is something that we had to learn the hard way. If you are running your redis instance on linux, you must set the system's swappiness to 0, to ensure swap is never used.
But, most importantly, you need to experiment with everything until you find a good solution.

Growing hash-of-queues beyond main memory limits

I have a cluster application, which is divided into a controller and a bunch of workers. The controller runs on a dedicated host, the workers phone in over the network and get handed jobs, so far so normal. (Basically the "divide-and-conquer pipeline" from the zeromq manual, with job-specific wrinkles. That's not important right now.)
The controller's core data structure is unordered_map<string, queue<string>> in pseudo-C++ (the controller is actually implemented in Python, but I am open to the possibility of rewriting it in something else). The strings in the queues define jobs, and the keys of the map are a categorization of the jobs. The controller is seeded with a set of jobs; when a worker starts up, the controller removes one string from one of the queues and hands it out as the worker's first job. The worker may crash during the run, in which case the job gets put back on the appropriate queue (there is an ancillary table of outstanding jobs). If it completes the job successfully, it will send back a list of new job-strings, which the controller will sort into the appropriate queues. Then it will pull another string off some queue and send it to the worker as its next job; usually, but not always, it will pick the same queue as the previous job for that worker.
Now, the question. This data structure currently sits entirely in main memory, which was fine for small-scale test runs, but at full scale is eating all available RAM on the controller, all by itself. And the controller has several other tasks to accomplish, so that's no good.
What approach should I take? So far, I have considered:
a) to convert this to a primarily-on-disk data structure. It could be cached in RAM to some extent for efficiency, but jobs take tens of seconds to complete, so it's okay if it's not that efficient,
b) using a relational database - e.g. SQLite, (but SQL schemas are a very poor fit AFAICT),
c) using a NoSQL database with persistency support, e.g. Redis (data structure maps over trivially, but this still appears very RAM-centric to make me feel confident that the memory-hog problem will actually go away)
Concrete numbers: For a full-scale run, there will be between one and ten million keys in the hash, and less than 100 entries in each queue. String length varies wildly but is unlikely to be more than 250-ish bytes. So, a hypothetical (impossible) zero-overhead data structure would require 234 – 237 bytes of storage.

Ultimately, it all boils down on how you define efficiency needed on part of the controller -- e.g. response times, throughput, memory consumption, disk consumption, scalability... These properties are directly or indirectly related to:
number of requests the controller needs to handle per second (throughput)
acceptable response times
future growth expectations
From your options, here's how I'd evaluate each option:
a) to convert this to a primarily-on-disk data structure. It could be
cached in RAM to some extent for efficiency, but jobs take tens of
seconds to complete, so it's okay if it's not that efficient,
Given the current memory hog requirement, some form of persistent storage seems a reaonsable choice. Caching comes into play if there is a repeatable access pattern, say the same queue is accessed over and over again -- otherwise, caching is likely not to help.
This option makes sense if 1) you cannot find a database that maps trivially to your data structure (unlikely), 2) for some other reason you want to have your own on-disk format, e.g. you find that converting to a database is too much overhead (again, unlikely).
One alternative to databases is to look at persistent queues (e.g. using a RabbitMQ backing store), but I'm not sure what the per-queue or overall size limits are.
b) using a relational database - e.g. SQLite, (but SQL schemas are a
very poor fit AFAICT),
As you mention, SQL is probably not a good fit for your requirements, even though you could surely map your data structure to a relational model somehow.
However, NoSQL databases like MongoDB or CouchDB seem much more appropriate. Either way, a database of some sort seems viable as long as they can meet your throughput requirement. Many if not most NoSQL databases are also a good choice from a scalability perspective, as they include support for sharding data across multiple machines.
c) using a NoSQL database with persistency support, e.g. Redis (data
structure maps over trivially, but this still appears very RAM-centric
to make me feel confident that the memory-hog problem will actually go
away)
An in-memory database like Redis doesn't solve the memory hog problem, unless you set up a cluster of machines that each holds a part of the overall data. This makes sense only if keeping all data in-memory is needed due to low response times requirements. Yet, given the nature of your jobs, taking tens of seconds to complete, response times, respective to workers, hardly matter.
If you find, however, that response times do matter, Redis would be a good choice, as it handles partitioning trivially using either client-side consistent-hashing or at the cluster level, thus also supporting scalability scenarios.
In any case
Before you choose a solution, be sure to clarify your requirements. You mention you want an efficient solution. Since efficiency can only be gauged against some set of requirements, here's the list of questions I would try to answer first:
*Requirements
how many jobs are expected to complete, say per minute or per hour?
how many workers are needed to do so?
concluding from that:
what is the expected load in requestes/per second, and
what response times are expected on part of the controller (handing out jobs, receiving results)?
And looking into the future:
will the workload increase, i.e. does your solution need to scale up (more jobs per time unit, more more data per job?)
will there be a need for persistency of jobs and results, e.g. for auditing purposes?
Again, concluding from that,
how will this influence the number of workers?
what effect will it have on the number of requests/second on part of the controller?
With these answers, you will find yourself in a better position to choose a solution.

I would look into a message queue like RabbitMQ. This way it will first fill up the RAM and then use the disk. I have up to 500,000,000 objects in queues on a single server and it's just plugging away.
RabbitMQ works on Windows and Linux and has simple connectors/SDKs to about any kind of language.
https://www.rabbitmq.com/

Serving millions of routes with good performance

I'm doing some research for a new project, for which the constraints and specifications have yet to be set. One thing that is wanted is a large number of paths, directly under the root domain. This could ramp up to millions of paths. The paths don't have a common structure or unique parts, so I have to look for exact matches.
Now I know it's more efficient to break up those paths, which would also help in the path lookup. However I'm researching the possibility here, so bear with me.
I'm evaluating methods to accomplish this, while maintaining excellent performance. I thought of the following methods:
Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
Storing the paths in a key-value store like Redis. This would be a lot better, and perform quite well I think (have to benchmark it though).
Doing string/regex matching - like many frameworks do out of the box - for this amount of possible matches is nuts and thus not really an option. But I could see how doing some sort of algorithm where you compare letter-by-letter, in combination with some smart optimizations, could work.
But maybe there are tools/methods I don't know about that are far more suited for this type of problem. I could use any tips on how to accomplish this though.
Oh and in case anyone is wondering, no this isn't homework.
UPDATE
I've tested the Redis approach. Based on two sets of keywords, I got 150 million paths. I've added each of them using the set command, with the value being a serialized string of id's I can use to identify the actual keywords in the request. (SET 'keyword1-keyword2' '<serialized_string>')
A quick test in a local VM with a data set of one million records returned promising results: benchmarking 1000 requests took 2ms on average. And this was on my laptop, which runs tons of other stuff.
Next I did a complete test on a VPS with 4 cores and 8GB of RAM, with the complete set of 150 million records. This yielded a database of 3.1G in file size, and around 9GB in memory. Since the database could not be loaded in memory entirely, Redis started swapping, which caused terrible results: around 100ms on average.
Obviously this will not work and scale nice. Either each web server needs to have a enormous amount of RAM for this, or we'll have to use a dedicated Redis-routing server. I've read an article from the engineers at Instagram, who came up with a trick to decrease the database size dramatically, but I haven't tried this yet. Either way, this does not seem the right way to do this. Back to the drawing board.

Storing the paths in an SQL database and doing a lookup on every request. This seems like the worst option and will definitely not be used.
You're probably underestimating what a database can do. Can I invite you to reconsider your position there?
For Postgres (or MySQL w/ InnoDB), a million entries is a notch above tiny. Store the whole path in a field, add an index on it, vacuum, analyze. Don't do nutty joins until you've identified the ID of your key object, and you'll be fine in terms of lookup speeds. Say a few ms when running your query from psql.
Your real issue will be the bottleneck related to disk IO if you get material amounts of traffic. The operating motto here is: the less, the better. Besides the basics such as installing APC on your php server, using Passenger if you're using Ruby, etc:
Make sure the server has plenty of RAM to fit that index.
Cache a reference to the object related to each path in memcached.
If you can categorize all routes in a dozen or so regex, they might help by allowing the use of smaller, more targeted indexes that are easier to keep in memory. If not, just stick to storing the (possibly trailing-slashed) entire path and move on.
Worry about misses. If you've a non-canonical URL that redirects to a canonical one, store the redirect in memcached without any expiration date and begone with it.
Did I mention lots of RAM and memcached?
Oh, and don't overrate that ORM you're using, either. Chances are it's taking more time to build your query than your data store is taking to parse, retrieve and return the results.
RAM... Memcached...
To be very honest, Reddis isn't so different from a SQL + memcached option, except when it comes to memory management (as you found out), sharding, replication, and syntax. And familiarity, of course.
Your key decision point (besides excluding iterating over more than a few regex) ought to be how your data is structured. If it's highly structured with critical needs for atomicity, SQL + memcached ought to be your preferred option. If you've custom fields all over and obese EAV tables, then playing with Reddis or CouchDB or another NoSQL store ought to be on your radar.
In either case, it'll help to have lots of RAM to keep those indexes in memory, and a memcached cluster in front of the whole thing will never hurt if you need to scale.

Redis is your best bet I think. SQL would be slow and regular expressions from my experience are always painfully slow in queries.
I would do the following steps to test Redis:
Fire up a Redis instance either with a local VM or in the cloud on something like EC2.
Download a dictionary or two and pump this data into Redis. For example something from here: http://wordlist.sourceforge.net/ Make sure you normalize the data. For example, always lower case the strings and remove white space at the start/end of the string, etc.
I would ignore the hash. I don't see the reason you need to hash the URL? It would be impossible to read later if you wanted to debug things and it doesn't seem to "buy" you anything. I went to http://www.sha1-online.com/, and entered ryan and got ea3cd978650417470535f3a4725b6b5042a6ab59 as the hash. The original text would be much smaller to put in RAM which will help Redis. Obviously for longer paths, the hash would be better, but your examples were very small. =)
Write a tool to read from Redis and see how well it performs.
Profit!
Keep in mind that Redis needs to keep the entire data set in RAM, so plan accordingly.

I would suggest using some kind of key-value store (i.e. a hashing store), possibly along with hashing the key so it is shorter (something like SHA-1 would be OK IMHO).

Duplicates from a stream

We have an external service that continuously sends us data. For the sake of simplicity lets say this data has three strings in tab delimited fashion.
datapointA datapointB datapointC
This data is received by one of our servers and then is forwarded to a processing engine where something meaningful is done with this dataset.
One of the requirements of the processing engine is that duplicate results will not be processed by the processing engine. So for instance on day1, the processing engine received
A B C, and on day 243, the same A B C was received by the server. In this particular situation, the processing engine will spit out a warning,"record already processed" and not process that particular record.
There may be a few ways to solve this issue:
Store the incoming data in an in-memory HashSet, and set exculsion
will indicate the processing status of the particular record.
Problems will arise when we have this service running with zero
downtime and depending on the surge of data, this collection can
exceed the bounds of memory. Also, in case of system outages, this
data needs to be persisted someplace.
Store the incoming data in the database and the next set of data will
only be processed if the data is not present in the database. This
helps with the durability of the history in case of some catastrophe
but there's the overhead of maintaing proper-indexes and aggressive
sharding in the case of performance related issues.
....or some other technique
Can somebody point out some case-studies or established patterns or practices to solve this particular issue?
Thanks

you need some kind of backing store, for persistence, whatever the solution. so no matter how much work that has to be implemented. but it doesn't have to be an sql database for something so simple - alternative to memcached that can persist to disk
in addition to that, you could consider bloom filters for reducing the in-memory footprint. these can give false positives, so then you would need to fall back to a second (slower but reliable) layer (which could be the disk store).
and finally, the need for idempotent behaviour is really common in messaging/enterprise systems, so a search like this turns up more papers/ideas (not sure if you're aware that "idempotent" is a useful search term).

You could create a hash of the data and store that in a backing store which would be smaller than the actual data (provided your data isn't smaller than a hash)

What's the best way to cache binary data?

I pre-generate 20+ million gzipped html pages, store them on disk, and serve them with a web server. Now I need this data to be accessible by multiple web servers. Rsync-ing the files takes too long. NFS seems like it may take too long.
I considered using a key/value store like Redis, but Redis only stores strings as values, and I suspect it will choke on gzipped files.
My current thinking is to use a simple MySQL/Postgres table with a string key and a binary value. Before I implement this solution, I wanted to see if anyone else had experience in this area and could offer advice.

I've head good about Redis, that's one.
I've also heard extremely positive things about memcached. It is suitable for binary data as well.
Take Facebook for example: These guys use memcached, also for the images!
As you know, images are in binary.
So, get memcached, get a machine to utilize it, a binder for PHP or whatever you use for your sites, and off you go! Good luck!

First off, why cache the gzips? Network latency and transmission time is orders of magnitude higher than the CPU time spent compressing the file so doing it on the fly maybe the simplest solution.
However,if you definitely have a need then I'm not sure a central database is going to be any quicker than a file share (of course you should be measuring not guessing these things!). A simple approach could be to host the original files on an NFS share and let each web server gzip and cache them locally on demand. memcached (as Poni suggests) is also a good alternative, but adds a layer of complexity.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio