Remote key-value storage allowing indexes? - performance

In our project we already have an embedded in-memory key-value storage for objects, and it is very usable, because it allows us to make indexes for it and query the storage based on it. So, if we have a collection of "Student"s, and a compound index on student.group and student.sex, then we can find all male students from group "ABC". Same for deletion and so on.
Now we have to adopt our service for working in a cloud, so that there will be multiple servers, processing user requests, and they have a shared state, stored in this key-value indexed storage. We tried to adopt memcashed for our needs, and it's almost ideal -- it is fast, simple and proven solution, but it doesn't have indexes, so we can't used to search our temporary data.
Is there any other way to have a remote cache, just like the memcashed, but with indexes?
Thank you.

Try hazelcast, It is an in-memory data grid that distributes the data among servers. You can have indexes just like you described in your question and query for them.
Usage is very simple. Just add Hazelcast.jar and start coding. It can be both embedded and remote.
Here is the index and query usage
add index
IMap<Integer, Student> myDistributedMap = Hazelcast.getMap("students")
myDistributedMap.addIndex("group", false);
myDistributedMap.addIndex("sex", false);
store in imdg
myDistributedMap.put(student.id, student)
;
query
Collection<Student> result = myDistributedMap.values(new SqlPredicate("sex=male AND group=ABC"));
Finally it works fine in the cloud. Ex: EC2

Related

Is it Ok to have single index for multiple tenant?(Azure Search)

Is it ok to have multiple tenant QnA's to be stored in a single data source? for ex: in Azure Table Storage with all QnA's stored in a single table but each tenant data differentiated by an unique key and then filter results based on their unique key, this would help me to reduce the azure service cost but is their any drawbacks in using this method ?
Sharing a service/index in developer/test environments is fine, but there are additional concerns for production environments. These are some drawbacks, though you might not care about some of them:
competing queries: high traffic volume for one tenant can affect query latency/throughput for another tenant
harder to manage data for individual tenants: can you easily delete all documents for a particular tenant? Would the whole index need to be deleted or recreated for any reason which will affect all tenants?
flexibility in location: multiple services allow you to put data physically closer to where the queries will be issued. There can also be legal requirements for where data is stored.
susceptible to bugs/human error: people make mistakes; how bad is it to return data for the wrong tenant? How would you guard against that?
permission management: do you need to need grant permissions to view data for a subset of the tenants?

ELK indexes design according to development goal

I used elastic in the past to analyze logs but I don't have any experience in elastic "architecture". I have an application that I deployed to multiple machines (200+). I want to connect to each machine and gather metadata like logs, metrics, db stats and so on..
With that data I want to be able to :
Find problems in each machine and notify about them (finding problems requires joining data between different sources, for example, finding exception in log1 requires me to go check the db)
Analyze common issues for all machines and implement ML model that will be able to predict issues.
I need to create indexes, and I thought about 2 options:
Create one index per each machine and then all the data related to each machine will be available in its index.
Create index per data source. For example, all db logs from all machines will be available in one dedicated index. Another index will contain only data that related to machine metrics (cpu/ram usage..)
What would be the best to create those indexes?
Ok, now that I got a better understanding of your needs, here's my suggestion:
I strongly recommend to not create an index per machine. I don't know much about your use case(s) but I assume you want to search the data either in kibana or by implementing search requests in your application.
Let's say you are interested in the ram usage of every machine. You would need to execute 200 search requests against elasticsearch since the data (ram usage) is spread over 200 indices (of course one could create aliases but these had to be updated for every new machine). Furthermore you wouldn't be able to do basic aggregations like which machine has the highest ram usage? in a convenient way. In my opinion there are plenty more disadvantages like index-management, shard-allocation etc.
So what's a better solution?
As you have already suggested, you should create an index per datasource. With that, your indices have a dedicated "purpose", e.g. one index that stores database data, the other system metrics and so on. Referring to my examples above, you only would need to execute one search request to determine a) the ram usage of every machine and b) which machine has the highest ram usage. However, this would require, that every document contains a field that references the particular host like so:
PUT metrics/_doc/1
{
"system":{
"ram": {
"usage": "45%",
"free": "55%"
}
},
"host":{
"name": "YOUR HOSTNAME",
"ip": "192.168.17.100"
}
}
In addition to that I recommend using daily indices. So instead of creating one huge index for the system metrics you would create an index for every day like metrics-2020.01.01, metrics-2020.01.02 and so on. This approach has the following advantages:
your indices will be much smaller in size, making them better to manage and (re-)allocate.
after some time period, you can roughly estimate the data size and be able to define the number of shards much better. With only one huge index, you would constantly need to update the number of shards in order to handle your requests in a fast way.
furthermore, you can search your data on a day-by-day basis in a convenient way.
you are able to setup ilm-policies to automate the maintenance of your indices, e.g. delete metrics-indices that are older than X days.
...
I hope I could help you!

Caching Strategy/Design Pattern for complex queries

We have an existing API with a very simple cache-hit/cache-miss system using Redis. It supports being searched by Key. So a query that translates to the following is easily cached based on it's primary key.
SELECT * FROM [Entities] WHERE PrimaryKeyCol = #p1
Any subsequent requests can lookup the entity in REDIS by it's primary key or fail back to the database, and then populate the cache with that result.
We're in the process of building a new API that will allow searches by a lot more params, will return multiple entries in the results, and will be under fairly high request volume (enough so that it will impact our existing DTU utilization in SQL Azure).
Queries will be searchable by several other terms, Multiple PKs in one search, various other FK lookup columns, LIKE/CONTAINS statements on text etc...
In this scenario, are there any design patterns, or cache strategies that we could consider. Redis doesn't seem to lend itself particularly well to these type of queries. I'm considering simply hashing the query params, and then cache that hash as the key, and the entire result set as the value.
But this feels like a bit of a naive approach given the key-value nature of Redis, and the fact that one entity might be contained within multiple result sets under multiple query hashes.
(For reference, the source of this data is currently SQL Azure, we're using Azure's hosted Redis service. We're also looking at alternative approaches to hitting the DB incl. denormalizing the data, ETLing the data to CosmosDB, hosting the data in Azure Search but there's other implications for doing these including Implementation time, "freshness" of data etc...)
Personally, I wouldn't try and cache the results, just the individual entities. When I've done things like this in the past, I return a list of IDs from live queries, and retrieve individual entities from my cache layer. That way the ID list is always "fresh", and you don't have nasty cache invalidation logic issues.
If you really do have commonly reoccurring searches, you can cache the results (of ids), but you will likely run into issues of pagination and such. Caching query results can be tricky, as you generally need to cache all the results, not just the first "page" worth. This is generally very expensive, and has high transfer costs that exceed the value of the caching.
Additionally, you will absolutely have freshness issues with caching query results. As new records show up, they won't be in the cached list. This is avoided with the entity-only cache, as the list of IDs is always fresh, just the entities themselves can be stale (but that has a much easier cache-expiration methodology).
If you are worried about the staleness of the entities, you can return not only an ID, but also a "Last updated date", which allows you to compare the freshness of each entity to the cache.

Searching/selecting query in cache

I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.

Couchbase as a cache and cache invalidation

I'm thinking about using Couchbase as a cache layer. I'm aware of the many advantages provided by Couchbase, like the easy scalability. But what interests me more is the rich document model of couchbase, compared to the simple key-value one of memcached.
My RDBMS is SQL Server, and we use NHibernate. The queries and the database are already quite optimized and I think that caching is the best option for further scaling.
My project is to implement a simple relationnel model between entities (much simpler than the one in the RDBMS), to handle invalidation. When an entity is invalidated (removed from cache) by the application, all dependent entities could also be removed. The logic of defining the dependencies between entities would be handled at the application level by a dedicated component. There would be 10 or 12 different entities (I don't want to cache all my application domain).
My document model in Couchbase would look like this:
Key (the one generated by the application), keys' format depends on entity type
Hashed key (to have a uniform unique key accross all entities)
Entity
Dependencies - list of hashed keys of the entities that must be removed when main entity is removed
So my questions are:
On invalidation, we would need to resolve a graph of dependencies (asynchronously). Is it fast to look for specific keys with around 500k entities?
Any feedback on the general idea?
Maintaining the dependencies between entities can be quite simplified, and might not be such a big issue.
Pierre
I use Couchbase 2.2 in production as a persistent cache layer and really happy with it (running about 2M documents). My app getting really fast gets (1 millisecond). Your idea is valid and I don't see anything wrong with using Couchbase as a entity storage for invalidation. Its a mature and very stable product.
You are correct in your entity design. You can have a main json doc that has list of references to other child documents. So that before deleting main document you will delete all children first.
Also, not sure if its applicable in your case, you can take advantage of Couchbase ability to expire documents. When you insert key/value(json doc) you can specify TTL(time to live) if you know it upfront. This way you don't need to explicitly delete entities from Couchbase.
Delete operation itself is fast (you can run it as asynchronous operation) and having 500K documents in the Couchbase cluster it really small size. You should see under 1 millisecond get operations.
But consider having minimum 3 Couchbase nodes in one cluster, so that you can take one node down at any given point of time without compromising data stored in the cluster. See Sizing a Couchbase Server 2.0 cluster
Some additional resources:
10 things developers should know about Couchbase
Top 10 things an Ops / Sys admin must know about Couchbase
App Development with Documents, their Schemas and Relationships
Couchbase Models
Here are my thoughts:
On invalidation, we would need to resolve a graph of dependencies
(asynchronously). Is it fast to look for specific keys with around
500k entities?
Are you looking for keys in your RDBMS or in CB? If in CB, you will need to use a view/index; now, views are disk-based, but stored in sorted order so they are no slower than SQL indices. Accessing them in parallel will be faster than in series. It will be the slow point in your operation though if you use CB.
Continuing along with this thought, I have used CB successfully to store and navigate a hierarchical data structure with 500k+ nodes in it. CB performs well, but does take a few seconds to spit out the whole index if I need it (which I do if I need to do a mass-update operation).
Any feedback on the general idea?
The idea is sound. In fact, I'm seeing 10x the performance of SQL with hierarchical queries when I run them on my Couchbase cluster. I also found that a single couchbase instance outperforms multiple instances when doing an index lookup - I do not know why that is (the 2-instance cb index is 5x faster than my SQL setup). To speed things up further, you can parellelize the queries to the cb index.

Resources