We have around 100 GB of data from different entities.
There are some dimension, some fact and some transaction entities.
Transaction items are in the range of 50 millions of rows.
All this data is stored in blob storage with one csv file for each entity.
There will be a copy in In-Role Cache for quick access.
In addition transactions happening are required to be processed using LINQ joins with some other entities and update, insert or delete items on entities stored on cache as well as the data on Blob Storage.
Is it a good idea to store each entity as a list (which is one object per entity) on the cache?
or store the each item(row) in the list(entity) separately on the cache?
what is the usual practice in this kind of cache scenarios?
Related
We have an existing API with a very simple cache-hit/cache-miss system using Redis. It supports being searched by Key. So a query that translates to the following is easily cached based on it's primary key.
SELECT * FROM [Entities] WHERE PrimaryKeyCol = #p1
Any subsequent requests can lookup the entity in REDIS by it's primary key or fail back to the database, and then populate the cache with that result.
We're in the process of building a new API that will allow searches by a lot more params, will return multiple entries in the results, and will be under fairly high request volume (enough so that it will impact our existing DTU utilization in SQL Azure).
Queries will be searchable by several other terms, Multiple PKs in one search, various other FK lookup columns, LIKE/CONTAINS statements on text etc...
In this scenario, are there any design patterns, or cache strategies that we could consider. Redis doesn't seem to lend itself particularly well to these type of queries. I'm considering simply hashing the query params, and then cache that hash as the key, and the entire result set as the value.
But this feels like a bit of a naive approach given the key-value nature of Redis, and the fact that one entity might be contained within multiple result sets under multiple query hashes.
(For reference, the source of this data is currently SQL Azure, we're using Azure's hosted Redis service. We're also looking at alternative approaches to hitting the DB incl. denormalizing the data, ETLing the data to CosmosDB, hosting the data in Azure Search but there's other implications for doing these including Implementation time, "freshness" of data etc...)
Personally, I wouldn't try and cache the results, just the individual entities. When I've done things like this in the past, I return a list of IDs from live queries, and retrieve individual entities from my cache layer. That way the ID list is always "fresh", and you don't have nasty cache invalidation logic issues.
If you really do have commonly reoccurring searches, you can cache the results (of ids), but you will likely run into issues of pagination and such. Caching query results can be tricky, as you generally need to cache all the results, not just the first "page" worth. This is generally very expensive, and has high transfer costs that exceed the value of the caching.
Additionally, you will absolutely have freshness issues with caching query results. As new records show up, they won't be in the cached list. This is avoided with the entity-only cache, as the list of IDs is always fresh, just the entities themselves can be stale (but that has a much easier cache-expiration methodology).
If you are worried about the staleness of the entities, you can return not only an ID, but also a "Last updated date", which allows you to compare the freshness of each entity to the cache.
We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?
It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish
It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.
I was wondering if I could get an explanation between the differences between In-Memory cache(redis, memcached), In-Memory data grids (gemfire) and In-Memory database (VoltDB). I'm having a hard time distinguishing the key characteristics between the 3.
Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straight forward. It depends on the application when to access the cahce vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the users movie recommendations that you see on the home screen.
In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like get last 10 accessed items, get the most used item etc). It can span multiple machine and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses Redis database to store the timeline information.
I don't know about gemfire and VoltDB, but even memcached and redis are very different. Memcached is really simple caching, a place to store variables in a very uncomplex fashion, and then retrieve them so you don't have to go to a file or database lookup every time you need that data. The types of variable are very simple. Redis on the other hand is actually an in memory database, with a very interesting selection of data types. It has a wonderful data type for doing sorted lists, which works great for applications such as leader boards. You add your new record to the data, and it gets sorted automagically.
So I wouldn't get too hung up on the categories. You really need to examine each tool differently to see what it can do for you, and the application you're building. It's kind of like trying to draw comparisons on nosql databases - they are all very different, and do different things well.
I would add that things in the "database" category tend to have more features to protect and replicate your data than a simple "cache". Cache is temporary (usually) where as database data should be persistent. Many cache solutions I've seen do not persist to disk, so if you lost power to your whole cluster, you'd lose everything in cache.
But there are some cache solutions that have persistence and replication features too, so the line is blurry.
An in-memory Cache is a common query store therefore relieves DB of read Workloads. Common examples of in-memory cache are Redis cache. An example could be Web site storing popular searches made by clients thereby relieving the DB of some load.
In-memory Cache provides query functionality on top of caching (storing session data in RAM (temporary storage)).
Memcache falls in the temp store caching category.
From the Spark official document, it says:
Spark SQL can cache tables using an in-memory columnar format by
calling sqlContext.cacheTable("tableName") or dataFrame.cache(). Then
Spark SQL will scan only required columns and will automatically tune
compression to minimize memory usage and GC pressure. You can call
sqlContext.uncacheTable("tableName") to remove the table from memory.
What does caching tables using a in-memory columnar format really mean?
Put the whole table into the memory? As we know that cache is also lazy,
the table is cached after the first action on the query. Does it make any difference to the cached table if choosing different actions or queries? I've googled this cache topic several times but failed to find some detailed articles. I would really appreciate it if anyone can provides some links or articles for this topic.
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Yes, caching the tables put the whole table in memory compressed if you use this setting: spark.sql.inMemoryColumnarStorage.compressed = true. Keep in mind, when doing caching on a DataFrame it is Lazy caching which means it will only cache what rows are used in the next processing event. So if you do a query on that DataFrame and only scan 100 rows, those will only be cached, not the entire table. If you do CACHE TABLE MyTableName in SQL though, it is defaulted to be eager caching and will cache the entire table. You can choose LAZY caching in SQL like so:
CACHE LAZY TABLE MyTableName
I am planning to do some in memory caching of my data for operations in my web service. This data would be basically lookup values which do not change frequently. I was planning to get all that data in datasets (multiple tables) and store them till the data does not change on DB side. This is so because some of my data never changes, where some may change quite frequently. Any idea?
I would probably cache it at the DataTable level, then each table could have it's own caching rules (expiration time, last updated, etc, etc).