Need persistent im memory cache with multi key lookups - caching

We are having a requirement , where we need to search for keys with multiple keys , and are looking for a multiple indexes .
For example:
Trade data contains the below parameters:
Date
Stock
Price
Quantity
Account
We will be storing each trade as a list with Stock as the key. This would give us the the ability to query , all the trades of a given stock. However , we would also have queries , like list of all the trades in an account. We would want to use this same cache to fetch this query instead of a new cache. The requirement is for an in memory cache(java) , as the latency requirement is very low. Also , we need a persistent cache , so that the cache is re-populated when the application is restarted.
Please let me know , if there is any good solution available , as the only way for persistent cache seems to be the distributed ones.

One way to ensure queries are faster is to create a TradeMeta Object with only the attributes you would like to query on ie
Date Stock Price Quantity Account
The TradeMeta Object can be stored in a Map with index on all the above keys . This ensure hazelcast maintains relevant buckets for easy lookup internally. Predicates can be set against this TradeMetaMap to fetch the keys . One you have the keys use getAsync to fetch the full trade objects from tradeMap.
To persist the cache you would require the Hazelcast EnterpriseHD which has HD storage and HotRestartStore

Related

Elassandra data modeling: When to create a secondary index and when not to

I'm not an absolute expert of Cassandra, but what I know (correct me if I'm wrong) is that creating a secondary index for all the fields in a data model is an anti-pattern.
I'm using Elassandra and my data model looks like this :
A users object that represents a user, with : userID, name, phone, e-mail, and all kind of infos on users (say these users are selling things)
A sales object that represent a sale made by the user, with : saleID, userID, product name, price, etc. (There can be a lot more fields)
Given that I want to make complex searches on the user (search by phone, search by e-mail, etc etc) only on name, e-mail and phone, is it a good idea to create the 3 following tables from this data model :
"User core" table with only userID, name, phone and e-mail (fields for search) [Table fully indexed and mapped in Elasticsearch]
"User info" table with userID + the other infos [Table not indexed or mapped in Elasticsearch]
"Sales" table with userID, saleID, product name, price, etc. [Table not indexed or mapped in Elasticsearch]
I see at least one advantage : Any kind of indexation (or reindexation when changes happen) and associated costs will happen only if there is a change in the "User core" table, which should not change too frequently.
Also, if I need to get all other infos (User other infos or sales), I can just make 2 queries: 1 in "User core" to get the userID and 1 in the other table (with the userID) to get the other data.
But I'm not sure this is a good pattern, or maybe I should not worry about secondary indexation and just basically index any other table ?
In a more summarized way, what are the key reasons to chose - a secondary index like Elasticsearch in Elassandra - VS - denormalizing tables and use partition&clustering keys - ?
Please feel free to ask if you need more examples on my use case.
You should not normalise the tables when you're using Cassandra. The most important aspect of data modelling for Cassandra is to design one table for each application query. To put it another way, you should always denormalise your tables.
After you've modelled a table for each query, index the table with Elassandra which contains the most columns that you need to query.
It's important to note that Elassandra is not a magic bullet. In a lot of cases, you do not need to index the tables if you have modelled them against your application queries correctly.
The use case for Elassandra is to take advantage of features such as free-form text search, faceting, boosting, etc., but it will not be as performant as a native table. The fact is index lookups require more "steps" than a straight-forward single-partition Cassandra read. Of course, YMMV depending on your use case and access patterns. Cheers!
I dont think Erick´s answer is fully correct in case of Elassandra.
It is correct that native Cassandra queries will outperform elastic and in pure cassandra you should wrap your tables around the queries.
But if you prefer flexibility over performance (and this is why you mainly choose to use elassandra), you can use cassandra as primary storage and benefit from cassandra´s replication performance and index the tables for search in elastic.
This enables you to be flexible on the search side and still be sure not to lose data, in case something goes wrong on the elastic side.
In fact on production we use a combination of both: tables have its partition / clustering keys and are indexed in elastic (when necessary). In backend you can decide, if you can query by cassandra keys or if elastic is required.

Caching Strategy/Design Pattern for complex queries

We have an existing API with a very simple cache-hit/cache-miss system using Redis. It supports being searched by Key. So a query that translates to the following is easily cached based on it's primary key.
SELECT * FROM [Entities] WHERE PrimaryKeyCol = #p1
Any subsequent requests can lookup the entity in REDIS by it's primary key or fail back to the database, and then populate the cache with that result.
We're in the process of building a new API that will allow searches by a lot more params, will return multiple entries in the results, and will be under fairly high request volume (enough so that it will impact our existing DTU utilization in SQL Azure).
Queries will be searchable by several other terms, Multiple PKs in one search, various other FK lookup columns, LIKE/CONTAINS statements on text etc...
In this scenario, are there any design patterns, or cache strategies that we could consider. Redis doesn't seem to lend itself particularly well to these type of queries. I'm considering simply hashing the query params, and then cache that hash as the key, and the entire result set as the value.
But this feels like a bit of a naive approach given the key-value nature of Redis, and the fact that one entity might be contained within multiple result sets under multiple query hashes.
(For reference, the source of this data is currently SQL Azure, we're using Azure's hosted Redis service. We're also looking at alternative approaches to hitting the DB incl. denormalizing the data, ETLing the data to CosmosDB, hosting the data in Azure Search but there's other implications for doing these including Implementation time, "freshness" of data etc...)
Personally, I wouldn't try and cache the results, just the individual entities. When I've done things like this in the past, I return a list of IDs from live queries, and retrieve individual entities from my cache layer. That way the ID list is always "fresh", and you don't have nasty cache invalidation logic issues.
If you really do have commonly reoccurring searches, you can cache the results (of ids), but you will likely run into issues of pagination and such. Caching query results can be tricky, as you generally need to cache all the results, not just the first "page" worth. This is generally very expensive, and has high transfer costs that exceed the value of the caching.
Additionally, you will absolutely have freshness issues with caching query results. As new records show up, they won't be in the cached list. This is avoided with the entity-only cache, as the list of IDs is always fresh, just the entities themselves can be stale (but that has a much easier cache-expiration methodology).
If you are worried about the staleness of the entities, you can return not only an ID, but also a "Last updated date", which allows you to compare the freshness of each entity to the cache.

Elasticsearch index design

I am maintaining a years of user's activity including browse, purchase data. Each entry in browse/purchase is a json object:{item_id: id1, item_name, name1, category: c1, brand:b1, event_time: t1} .
I would like to compose different queries such like getting all customers who browsed item A, and or purchased item B within time range t1 to t2. There are tens of millions customers.
My current design is to use nested object for each customer:
customer1:
customer_id,id1,
name: name1,
country: US,
browse: [{browseentry1_json},{browseentry2_json},...],
purchase: [{purchase entry1_json},{purchase entry2_json},...]
With this design, I can easily compose all kinds of queries with nested query. The only problem is that it is hard to expire older browse/purchase data: I only wanna keep, for example, a years of browse/purchase data. In this design, I will have to at some point, read the entire index out, delete the expired browse/purchase data, and write them back.
Another design is to use parent/child structure.
type: user is the parent of type browse and purchase.
type browse will contain each browse entry.
Although deleting old data seems easier with delete by query, for the above query, I will have to do multiple and/or has_child queries,and it would be much less performant. In fact, initially i was using parent/child structure, but the query time seemed really long. I thus gave it up and tried to switch to nested object.
I am also thinking about using nested object, but break the data into different index(like monthly index) so that I can easily expire old data. The problem with this approach is that I have to query across those multiple indexes, and do aggregation on that to get the distinct users, which I assume will be much slower.(havn't tried yet). One requirement of this project is to be able to give the count of the queries in acceptable time frame.(like seconds) and I am afraid this approach may not be acceptable.
The ES cluster is 7 machines, each 8 cores and 32G memory.
Any suggestions?
Thanks in advance!
Chen
Instead of creating a customers index I would create a "Browsing" indices (indexes) and a "Purchasing" indices separated by a timespan (EG: Monthly, as you mentioned in your last paragraph).
In each struct I would add the customer fields. Now you are facing two different approaches:
1. You can add only a reference to the customer (such as id) and make another query to get his details.
2. If you don't have any storage problem you can keep all the customer's data in each struct.
if this doesn't enough for performance you can combine it with "routing" and save all specific user's data on the same shard. and Elasticsearch won't need to fetch data between shards (you can watch this video where Shay Benon explains about "user data flow")
Niv

Compound rowkey in Azure Table storage

I want to move some of my Azure SQL tables to Table storage. As far as I understand, I can save everything in the same table, seperating it using PartitionKey and keeping it unique within each partition using Rowkey.
Now, I have a table with a compound key:
ParentId: (uniqueidentifier)
ReportTime: (datetime)
I also understand RowKeys have to be strings. Will I need to combine these in a single string? Or can I combine multiple keys some other way? Do I need to make a new key perhaps?
Any help is appreciated.
UPDATE
My idea is to put data from several (three for now) database tables and put in the same storage table seperating them with the partition key.
I will query using the ParentId and a WeekNumber (another column). This table has about 1 million rows that's deleted weekly from the db. My two other tables has about 6 million and 3.5 million
This question is pretty broad and there is no right answer.
The specific question - can you use Compound Keys with Azure Table Storage. Yes, you can do that. But this involves manual Serializing / Deserializing of your object's properties. You can achieve that by overriding the TableEntity's ReadEntity and WriteEntity methods. Check this detailed blog post on how can you override these methods to use your own custom serialization/deserialization.
I will further discuss my view on your more broader question.
First of all, why you want to put data from 3 (SQL) tables into one (Azure Table)? Just have 3 Azure tables.
Second thought, as Fabrizio points out is how are you going to query the records. Because Windows Azure Table service has only one index, and that is PartitionKey + RowKey properties (columns). If you are pretty sure you will mostly query data by known PartitionKey and RowKey, then Azure Tables is perfectly suiting you! However you say that your combination for RowKey is ParentId + WeekNumber! That means that a record is uniquely identified by this combination! If it is true, then you are even more ready to go.
Next you say you are going to delete records every week! You should know that DELETE operation acts on a single entity. You can use Entity Group Transactions to DELETE multiple entities at once, but there is a limit of (a) All entities in batch operation must have the same PartitionKey, (b) The maximum number of entities per batch is 100, and (c) The maximum size of batch operation is 4MB. Say you have 1M records like you say. In order to delete them, you have to first retrieve them in groups by 100, then delete in groups by 100. These are, in best possible case 10k operations on retrieval and 10k operations on deletion. Event if it will only cost 0.002 USD, think about time taken to execute 10k operations against a REST API.
Since you have to delete entities on a regular basis, which is fixed to a WeekNumber let's say, I can suggest that you dynamically create your tables and include the week number in its name. Thus you will achieve:
Even better partitioning of information
Easier and more granular information backup / delete
Deleting millions of entities requires just one operation - delete table.
There is not an unique solution for your problem. Yes, you can use ParentID as PartitionKey and ReportTime as Rowkey (or invert the assignment). But the big 2 main questions re: how do you query your data, with what conditions? and how many data do you store? 1000, 1 million items, 1000 millions items? The total storage usage is important. But it's also very important to consider the number of transaction you will generate to the storage.

Remote key-value storage allowing indexes?

In our project we already have an embedded in-memory key-value storage for objects, and it is very usable, because it allows us to make indexes for it and query the storage based on it. So, if we have a collection of "Student"s, and a compound index on student.group and student.sex, then we can find all male students from group "ABC". Same for deletion and so on.
Now we have to adopt our service for working in a cloud, so that there will be multiple servers, processing user requests, and they have a shared state, stored in this key-value indexed storage. We tried to adopt memcashed for our needs, and it's almost ideal -- it is fast, simple and proven solution, but it doesn't have indexes, so we can't used to search our temporary data.
Is there any other way to have a remote cache, just like the memcashed, but with indexes?
Thank you.
Try hazelcast, It is an in-memory data grid that distributes the data among servers. You can have indexes just like you described in your question and query for them.
Usage is very simple. Just add Hazelcast.jar and start coding. It can be both embedded and remote.
Here is the index and query usage
add index
IMap<Integer, Student> myDistributedMap = Hazelcast.getMap("students")
myDistributedMap.addIndex("group", false);
myDistributedMap.addIndex("sex", false);
store in imdg
myDistributedMap.put(student.id, student)
;
query
Collection<Student> result = myDistributedMap.values(new SqlPredicate("sex=male AND group=ABC"));
Finally it works fine in the cloud. Ex: EC2

Resources