My application should handle a lot of entities (100.000 or more) with location and needs to display them only within a given radius. I basically store everything in SQL but using Redis for caching and optimization (mainly GEORADIUS).
I am adding the entities like the following example (not exactly this, I use Laravel framework with the built-in Redis facade but it does the same as here in the background):
GEOADD k 19.059982 47.494338 {\"id\":1,\"name\":\"Foo\",\"address\":\"Budapest, Astoria\",\"lat\":47.494338,\"lon\":19.059982}
Is it bad practice? Or will it make a negative impact on performance? Should I store only ID-s as member and make a following query to get the corresponding entities?
This is a matter of the requirements. There's nothing wrong with storing the raw data as members as long as it is unique (and it unique given the "id" field). In fact, this is both simple and performant as all data is returned with a single query (assuming that's what actually needed).
That said, there are at least two considerations for storing the data outside the Geoset, and just "referencing" it by having members reflect some form of their key names:
A single data structure, such as a Geoset, is limited by the resources of a single Redis server. Storing a lot of data and members can require more memory than a single server can provide, which would limit the scalability of this approach.
Unless each entry's data is small, it is unlikely that all query types would require all data returned. In such cases, keeping the raw data in the Geoset generates a lot of wasted bandwidth and ultimately degrades performance.
When data needs to be updated, it can become too expensive to try and update (i.e. ZDEL and then GEOADD) small parts of it. Having everything outside, perhaps in a Hash (or maybe something like RedisJSON) makes more sense then.
Related
We have an existing API with a very simple cache-hit/cache-miss system using Redis. It supports being searched by Key. So a query that translates to the following is easily cached based on it's primary key.
SELECT * FROM [Entities] WHERE PrimaryKeyCol = #p1
Any subsequent requests can lookup the entity in REDIS by it's primary key or fail back to the database, and then populate the cache with that result.
We're in the process of building a new API that will allow searches by a lot more params, will return multiple entries in the results, and will be under fairly high request volume (enough so that it will impact our existing DTU utilization in SQL Azure).
Queries will be searchable by several other terms, Multiple PKs in one search, various other FK lookup columns, LIKE/CONTAINS statements on text etc...
In this scenario, are there any design patterns, or cache strategies that we could consider. Redis doesn't seem to lend itself particularly well to these type of queries. I'm considering simply hashing the query params, and then cache that hash as the key, and the entire result set as the value.
But this feels like a bit of a naive approach given the key-value nature of Redis, and the fact that one entity might be contained within multiple result sets under multiple query hashes.
(For reference, the source of this data is currently SQL Azure, we're using Azure's hosted Redis service. We're also looking at alternative approaches to hitting the DB incl. denormalizing the data, ETLing the data to CosmosDB, hosting the data in Azure Search but there's other implications for doing these including Implementation time, "freshness" of data etc...)
Personally, I wouldn't try and cache the results, just the individual entities. When I've done things like this in the past, I return a list of IDs from live queries, and retrieve individual entities from my cache layer. That way the ID list is always "fresh", and you don't have nasty cache invalidation logic issues.
If you really do have commonly reoccurring searches, you can cache the results (of ids), but you will likely run into issues of pagination and such. Caching query results can be tricky, as you generally need to cache all the results, not just the first "page" worth. This is generally very expensive, and has high transfer costs that exceed the value of the caching.
Additionally, you will absolutely have freshness issues with caching query results. As new records show up, they won't be in the cached list. This is avoided with the entity-only cache, as the list of IDs is always fresh, just the entities themselves can be stale (but that has a much easier cache-expiration methodology).
If you are worried about the staleness of the entities, you can return not only an ID, but also a "Last updated date", which allows you to compare the freshness of each entity to the cache.
Is there an advantage to storing the metadata (or indexing data) for a document/*LOB separate from the raw data.
For instance having a table/collection/bucket with index on (name,school)
ID: 123
name: Johny
School: Harvard
Transcript: /*2MB text/binary*/
vs
Metadata
ID: 123
name: Johny
School: Harvard
Data
ID: 123
Transcripts: /*2MB text/binary*/
Let's assume mongodb, although it's really db agnostic perhaps.
db.firstModel.find({},{transcripts:0}) vs db.secondModel.find()
Additionally if we have aggregation/grouping on the metadata, would the heavy payload in transcripts weigh it down (even though the aggregation is on other fields)? is it better to aggregate on the metadata collection separately, then retrieve by id from the data collection? Or is it better to respect the database design (keeping everything coupled in a single document)?
In Couchbase, if it works for your use case, an option might be to have the object ID for your 2MB document something like harvard::johny::123. Every object would have such a pattern for each object ID that is used consistently in your application. Therefore your application easily piece together the object ID. Then you do not have to query or use views. You know it is harvard and johny and his 123rd object, you can just get it by ID. You already know the answer, no querying and so Couchbase will be very fast.
That being said, there may be other meta data that you want to keep in that metadata object and you want to index on and then yes, in Couchbase it might be better to break out the documents like you suggest. In Couchbase it might even be better to put them in separate buckets so the indexers are only looking at things it will index.
For an example that may not be entirely applicable to your use case, but should give you an idea of what is possible go here
All of that being said, from experience I do not like keeping larger object like you suggest in a DB long term, regardless of the DB. From an operational perspective it is terrible. You are storing what amounts to static data in a layer that needs to be very performant, with usually expensive storage and having to backup those objects over time. They become a boat anchor around your neck after a few months/years. I suggest keeping the meta-data in a fast performing system like Couchbase (cache+persistence with replication, etc) that also has a pointer to the large objects in something that is best for dishing out large static objects like HDFS, Amazon S3, etc.
I have a web application that stores project in the database.
I have decided to use App Farbic Caching to speed performance.
What would be the best pattern regarding the below (or on which criteria should I decide):
store each project separately in the cache.
OR store the whole list in the cache (i.e. one key which represent the list of items)?
Many Thanks,
Joseph
It depends. There are a couple of considerations.
If the list was potentially enormous, the content of the individual cache key could get very large (obviously this could be mitigated by enabling local caching). Serializing and de-serializing a large object graph like this is going to consume time and resources on your client.
You may however want to do this, as you may in your application want to execute a linq to objects query against your list after it has been de-serialized back from the cache.
If the queries you execute against the list are well defined, you could cache multiple flavors of the list under different cache keys - instead of people, you could have PeopleMale, PeopleFemale, PeopleAmerican, PeopleIrish, PeopleFrench etc.
If you do this you could potentially have the same person appearing under multiple cached person lists and you would have to manage this.
For example, I have a female person with dual american and irish citizenship. If I edit that person so the gender changes from female to make and the citizenship is changed to dutch, it would be necessary to ensure that four keys are invalidated PeopleMale, PeopleFemale, PeopleAmerican, PeopleIrish.
The example I've given above could get tricky to manage - whether its worth it or not really depends on your exact use case.
In general, where possible, I'd advise you to only use cache keys containing lists for relatively non-volatile reference data (countries, status types, nationalities etc).
Hope this helps.
I'm working on a new Joomla! module where I need to store a read-only data of about 40 key/value pairs with a keyword and corresponding URL link. There are several options but I'm not sure which one would be convenient for the programmer and fast-loading for the user. Or maybe because the data amount is so small it doesn't really matter what method is used.
I could hardcode the values into an array as part of the module code. Not convenient to update but it does load fast.
I could store the data in an flat file or XML file. This would require additional code to implement and would be convenient for updating the list, but doesn't load as fast as being hardcoded.
I could create a table in the database. The Joomla API makes this is a no brainer to use but I'm not sure how much overhead there would with everything else being loaded from the database.
How do I logically evaluate which one works best without trying out each of the options?
Your two opposing concerns are
frequency with which the programmer updates these key value pairs
frequency with which the application queries them
If they're updated more than occasionally, your best bet is to have them in the database and then cache the data at some desirable interval if you're worried about it.
Most of the time,we just get the result from database,and then save it in cache server,with an expiration time.
When do we need to persistent that key/value pair,what's the significant benifit to do so?
If you need to persist the data, then you would want a key/value database. In particular, as part of the NoSQL movement, many people have suggested replacing traditional SQL databases with Key/Value pair databases - but ultimately, the choice remains with you which paradigm is a better fit for your application.
Use a key/value database when you are using a key/value cache and you don't need a sql database.
When you use memcached/mysql or similar, you need to write two sets of data access code - one for getting objects from the cache, and another from the database. If the cache is your database, you only need the one method, and it is usually simpler code.
You do lose some functionality by not using SQL, but in a lot of cases you don't need it. Only the worst applications actually leave constraint checking to the database. Ad-hoc queries become impractical at scale. The occasional lost or inconsistent record simply doesn't matter if you are working with tweets rather than financial data. How do you justify the added complexity of using a SQL database?