Storing metadata and raw data separately - oracle

Is there an advantage to storing the metadata (or indexing data) for a document/*LOB separate from the raw data.
For instance having a table/collection/bucket with index on (name,school)
ID: 123
name: Johny
School: Harvard
Transcript: /*2MB text/binary*/
vs
Metadata
ID: 123
name: Johny
School: Harvard
Data
ID: 123
Transcripts: /*2MB text/binary*/
Let's assume mongodb, although it's really db agnostic perhaps.
db.firstModel.find({},{transcripts:0}) vs db.secondModel.find()
Additionally if we have aggregation/grouping on the metadata, would the heavy payload in transcripts weigh it down (even though the aggregation is on other fields)? is it better to aggregate on the metadata collection separately, then retrieve by id from the data collection? Or is it better to respect the database design (keeping everything coupled in a single document)?

In Couchbase, if it works for your use case, an option might be to have the object ID for your 2MB document something like harvard::johny::123. Every object would have such a pattern for each object ID that is used consistently in your application. Therefore your application easily piece together the object ID. Then you do not have to query or use views. You know it is harvard and johny and his 123rd object, you can just get it by ID. You already know the answer, no querying and so Couchbase will be very fast.
That being said, there may be other meta data that you want to keep in that metadata object and you want to index on and then yes, in Couchbase it might be better to break out the documents like you suggest. In Couchbase it might even be better to put them in separate buckets so the indexers are only looking at things it will index.
For an example that may not be entirely applicable to your use case, but should give you an idea of what is possible go here
All of that being said, from experience I do not like keeping larger object like you suggest in a DB long term, regardless of the DB. From an operational perspective it is terrible. You are storing what amounts to static data in a layer that needs to be very performant, with usually expensive storage and having to backup those objects over time. They become a boat anchor around your neck after a few months/years. I suggest keeping the meta-data in a fast performing system like Couchbase (cache+persistence with replication, etc) that also has a pointer to the large objects in something that is best for dishing out large static objects like HDFS, Amazon S3, etc.

Related

What are the typical ways to cache the result of a relational database query using Redis?

What do developers commonly use as the key and value to cache the result from a SQL query into Redis? For example, if I have a Users table, and I want to cache the results from the query:
SELECT name, age FROM Users
1) Which Redis data structure should I use? Should I just have a single Key for the query and store the entire object returned by the database as the Value as such:
{ key: { object returned by database } }
Or should I use Redis' List data structure and loop through the rows individually and push them into the List as such:
{ key: [ ... ]}
Wouldn't this add computation time of O(N)? How is this more effective than just simply storing the object returned by the database?
Or should I use Redis' Hash Map data structure and loop through the rows individually and set a unique Key for each row with its corresponding attributes as such:
{ key1: {name: 'Bob', age: 25} }, { key2: {name: 'Sally', age: 15} }, ...
2) What would be a good rule of thumb with regards to the Key? From my understanding, some people just use the SQL query as the Key? But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)? Is this the best way to do it? If you are using an ORM, do you still use the SQL query as the key?
This is nicely analyzed in the Database Caching Strategies Using Redis whitepaper, by AWS.
Here the options discussed in the document. What is best is really a design decision based on tradeoffs you have to make for your specific use-case.
Cache the Database SQL ResultSet
Cache a serialized ResultSet object that contains the fetched database
row.
Pro: When data retrieval logic is abstracted (e.g., as in a Data Access Object or DAO layer), the consuming code expects only a
ResultSet object and does not need to be made aware of its
origination. A ResultSet object can be iterated over, regardless of
whether it originated from the database or was deserialized from the
cache, which greatly reduces integration logic. This pattern can be
applied to any relational database.
Con: Data retrieval still requires extracting values from the ResultSet object cursor and does not further simplify data access; it
only reduces data retrieval latency.
Cache Select Fields and Values in a Custom Format
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: This approach is easy to implement. You essentially store specific retrieved fields and values into a structure such as JSON or
XML and then SET that structure into a Redis string. The format you
choose should be something that conforms to your application’s data
access pattern.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis string and database
results). In addition, you are required to parse through the entire
structure to retrieve the individual attributes associated with it.
Cache Select Fields and Values into an Aggregate Redis Data Structure
Cache the fetched database row into a specific data structure that can
simplify the application’s data access.
Pro: When converting the ResultSet object into a format that simplifies access, such as a Redis Hash, your application is able to
use that data more effectively. This technique simplifies your data
access pattern by reducing the need to iterate over a ResultSet object
or by parsing a structure like a JSON object stored in a string. In
addition, working with aggregate data structures, such as Redis Lists,
Sets, and Hashes provide various attribute level commands associated
with setting and getting data, eliminating the overhead associated
with processing the data before being able to leverage it.
Con: Your application is using different types of objects when querying for particular data (e.g., Redis Hash and database results).
Cache Serialized Application Object Entities
Cache a subset of a fetched database row into a custom structure that
can be consumed by your applications.
Pro: Use application objects in their native application state with simple serializing and deserializing techniques. This can
rapidly accelerate application performance by minimizing data
transformation logic.
Con: Advanced application development use case
Regarding 2)
What would be a good rule of thumb with regards to the Key?
Using the SQL query as the Key is OK for as long as you are sure it is unique. Add prefixes if there is a risk of not-uniqueness. You may have other databases with the same table names, leading to the same queries. Also make them invariant: all lower case or upper case. Redis keys are case-sensitive.
But if you do so, does that mean you would have to store the entire object returned by the database as the Value (as per question 1)?
Not necessarily, it comes down to what processing you are doing with the query. Chances are some are best stored as raw entire object for processing, some as JSON-stringified object to return quickly to the client, some as rows, etc. The best is to adapt accordingly.
Is this the best way to do it?
Not necessarily.
If you are using an ORM, do you still use the SQL query as the key?
You may if your ORM easily exposes the SQL Query programmatically, and it is consistent.
I wouldn't get fixed on the idea of using the SQL Query as key, use something you can be sure it is consistent, it will optimize your processing, and you'll have clear rules to invalidate. It could be the method call with parameters, the web API call, etc.

Is it bad practice to store JSON members with Redis GEOADD?

My application should handle a lot of entities (100.000 or more) with location and needs to display them only within a given radius. I basically store everything in SQL but using Redis for caching and optimization (mainly GEORADIUS).
I am adding the entities like the following example (not exactly this, I use Laravel framework with the built-in Redis facade but it does the same as here in the background):
GEOADD k 19.059982 47.494338 {\"id\":1,\"name\":\"Foo\",\"address\":\"Budapest, Astoria\",\"lat\":47.494338,\"lon\":19.059982}
Is it bad practice? Or will it make a negative impact on performance? Should I store only ID-s as member and make a following query to get the corresponding entities?
This is a matter of the requirements. There's nothing wrong with storing the raw data as members as long as it is unique (and it unique given the "id" field). In fact, this is both simple and performant as all data is returned with a single query (assuming that's what actually needed).
That said, there are at least two considerations for storing the data outside the Geoset, and just "referencing" it by having members reflect some form of their key names:
A single data structure, such as a Geoset, is limited by the resources of a single Redis server. Storing a lot of data and members can require more memory than a single server can provide, which would limit the scalability of this approach.
Unless each entry's data is small, it is unlikely that all query types would require all data returned. In such cases, keeping the raw data in the Geoset generates a lot of wasted bandwidth and ultimately degrades performance.
When data needs to be updated, it can become too expensive to try and update (i.e. ZDEL and then GEOADD) small parts of it. Having everything outside, perhaps in a Hash (or maybe something like RedisJSON) makes more sense then.

Searching/selecting query in cache

I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.

Couchbase as a cache and cache invalidation

I'm thinking about using Couchbase as a cache layer. I'm aware of the many advantages provided by Couchbase, like the easy scalability. But what interests me more is the rich document model of couchbase, compared to the simple key-value one of memcached.
My RDBMS is SQL Server, and we use NHibernate. The queries and the database are already quite optimized and I think that caching is the best option for further scaling.
My project is to implement a simple relationnel model between entities (much simpler than the one in the RDBMS), to handle invalidation. When an entity is invalidated (removed from cache) by the application, all dependent entities could also be removed. The logic of defining the dependencies between entities would be handled at the application level by a dedicated component. There would be 10 or 12 different entities (I don't want to cache all my application domain).
My document model in Couchbase would look like this:
Key (the one generated by the application), keys' format depends on entity type
Hashed key (to have a uniform unique key accross all entities)
Entity
Dependencies - list of hashed keys of the entities that must be removed when main entity is removed
So my questions are:
On invalidation, we would need to resolve a graph of dependencies (asynchronously). Is it fast to look for specific keys with around 500k entities?
Any feedback on the general idea?
Maintaining the dependencies between entities can be quite simplified, and might not be such a big issue.
Pierre
I use Couchbase 2.2 in production as a persistent cache layer and really happy with it (running about 2M documents). My app getting really fast gets (1 millisecond). Your idea is valid and I don't see anything wrong with using Couchbase as a entity storage for invalidation. Its a mature and very stable product.
You are correct in your entity design. You can have a main json doc that has list of references to other child documents. So that before deleting main document you will delete all children first.
Also, not sure if its applicable in your case, you can take advantage of Couchbase ability to expire documents. When you insert key/value(json doc) you can specify TTL(time to live) if you know it upfront. This way you don't need to explicitly delete entities from Couchbase.
Delete operation itself is fast (you can run it as asynchronous operation) and having 500K documents in the Couchbase cluster it really small size. You should see under 1 millisecond get operations.
But consider having minimum 3 Couchbase nodes in one cluster, so that you can take one node down at any given point of time without compromising data stored in the cluster. See Sizing a Couchbase Server 2.0 cluster
Some additional resources:
10 things developers should know about Couchbase
Top 10 things an Ops / Sys admin must know about Couchbase
App Development with Documents, their Schemas and Relationships
Couchbase Models
Here are my thoughts:
On invalidation, we would need to resolve a graph of dependencies
(asynchronously). Is it fast to look for specific keys with around
500k entities?
Are you looking for keys in your RDBMS or in CB? If in CB, you will need to use a view/index; now, views are disk-based, but stored in sorted order so they are no slower than SQL indices. Accessing them in parallel will be faster than in series. It will be the slow point in your operation though if you use CB.
Continuing along with this thought, I have used CB successfully to store and navigate a hierarchical data structure with 500k+ nodes in it. CB performs well, but does take a few seconds to spit out the whole index if I need it (which I do if I need to do a mass-update operation).
Any feedback on the general idea?
The idea is sound. In fact, I'm seeing 10x the performance of SQL with hierarchical queries when I run them on my Couchbase cluster. I also found that a single couchbase instance outperforms multiple instances when doing an index lookup - I do not know why that is (the 2-instance cb index is 5x faster than my SQL setup). To speed things up further, you can parellelize the queries to the cb index.

How is wordweb english dictionary implemented?

We need to keep some in-memory data structure to keep english word dictionary in memory.
When the computer/wordweb starts,we need to read dictionary from disk into an in-memory data structure.
This question asks how do we populate from disk to in-memory data structure in typical real world dictionaries say wordweb?
Ideally we would like to keep dictionary in disk in the way, we require it in in-memory, so that we don't have to spend time building in-memory data structure, we just read it off the disk. But for linked lists, pointers etc, how do we store the same image in disk. Some relative addresses etc would help here?
Typically, is the entire dictionary read and stored in memory. or only part/handlers and leaf page IOs are done, when searching for a specific word.
If somebody wants to help with what that in-memory data structure is typically, please go ahead.
Thanks,
You mentioned pointers, so I'm assuming you're using C++; if that's the case and you want to read directly from disk into memory without having to "rebuild" your data structure, then you might want to look into serialization: How do you serialize an object in C++?
However, you generally don't want to load the entire dictionary anyway, especially if it's a user application. If the user is looking up dictionary words, then reading from disk happens so fast that the user will never notice the "delay." If you're servicing hundreds or thousands of requests, then it might make sense to cache the dictionary into memory.
So how many users do you have?
What kind of load are you expecting to have on the application?
Wordweb is using Sqlite Database at backend. It makes sense to me to use a Database system to store the content so its easier to GET the content which the user is looking for quickly.
Wordweb has Word prediction as well... so it will be a query to database like
select word from table where word='ab%';
on the other hand, when the user presses enter for the word
select meaning from table where word='abandon'
You do not want to be Serializing the content from disk to memory while the user is typing or after he has pressed Enter to search. Since the data will be large (Dictionary), Serialization will probably take time more then the user will tolerate for every word search.
Else why don't you create a JSON format File containing all the meaning creating a short form of Dictionary ?

Resources