Best-performing method for associating arbitrary key/value pairs with a table row in a Postgres DB? - performance

I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?

That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.

hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.

As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.

Related

Using AWS Appsync with DynamoDB, should you model relationships by storing "redundant copies" of related data on the same table (denormalization)?

I was recently reading through this section in the ElasticSearch documentation (or the guide to be more precise). It says that you should try to use a non-relational database the intended way, meaning you should avoid joins between different tables because they are not designed to handle those well. This also reminds me on the section in the DynamoDB docs stating that most well-designed DynamoDB backends only require one table.
Let's take as an example a recipes database where each recipe is using several ingredients. Every ingredient can be used in many different recipes.
Option 1: The obvious way to me to model this in AppSync and DynamoDB, would be to start with an ingredients table which has one item per ingredient storing all the ingredient data, with the ingredient id as partition key. Then I have another recipes table with the partion key recipe id and an ingredients field storing all the ingredient ids in an array. In AppSync I could then query a recipe by doing a GetItem request by recipe id and then resolving the ingredients field with a BatchGetItem on the ingredients table. Let's say a recipe contains 10 ingredients on average, so this would mean 11 GetItem requests sent to the DynamoDB tables.
Option 2: I would consider this a "join like" operation which is apparently not the ideal way to use non-relational databases. So, alternatively I could do the following: Make "redundant copies" of all the ingredient data on the recipes table and not only save the ingredient id there, but also all the other data from the ingredients table. This could drastically increase disk space usage, but apparently disk space is cheap and the increase in performance by only doing 1 GetItem request (instead of 11) could be worth it. As discussed later in the ElasticSearch guide this would also require some extra work to ensure concurrency when ingredient data is updated. So I would probably have to use a DynamoDB stream to update all the data in the recipes table as well when an ingredient is updated. This would require an expensive Scan to find all the recipes using the updated ingredient and a BatchWrite to update all these items. (An ingredient update might be rare though, so the increase in read performance might be worth that.)
I would be interested in hearing your thoughts on this:
Which option would you choose and why?
The second "more non-relational way" to do this seems painful and I am worried that with more levels/relations appearing (for example if users can create menus out of recipes), the resulting complexity could get out of hand quickly when I have to save "redundant copies" of the same data multiple times. I don't know much about relational databases, but these things seem much simpler there when every data has its unique location and that's it (I guess that's what "normalization" means).
Is getRecipe in the Option 1 really 11 times more expensive (performance and cost wise) than in Option 2? Or do I misunderstand something?
Would Option 1 be a cheaper operation in a relational database (e.g. MySQL) than in DynamoDB? Even though it's a join if I understand correctly, it's also just 11 ("NoSQL intended way") GetItem operations. Could this still be faster than 1 SQL query?
If I have a very relational data structure could a non-relational database like DynamoDB be a bad choice? Or is AppSync/GraphQL a way to still make it a viable choice (by allowing Option 1 which is really easy to build)? I read some opinions that constantly working around the missing join capability when querying NoSQL databases and having to do this on the application side is the main reason why it's not a good fit. But AppSync might be a way to solve this problem. Other opinions (including the DynamoDB docs) mention performance issues as the main reason why you should always query just one table.
This is quite late, I know, but might help someone down the road.
Start with an entity relationship diagram as this will help determine your options. Even in NoSQL, there are standard ways of modeling relationships.
Next, define your access patterns. Go through all the CRUDL operations and make sure that for each operation, you can access the specific data for that operation. For example, in your option 1 where ingredients are stored in an array in a field: think through an access pattern where you might need to delete an ingredient in a recipe. To do this, you need to know the index of the item in the array. Therefore, you have to obtain the entire array, find the index of the item, and then issue another call to update the array, taking into account possible race conditions.
Doing this in your application, while possible, is not efficient. You can also code this up in your resolver, but attempting to do so using velocity template language is not worth the headache, trust me.
The TL;DR is to model your entire application's entity relationship diagram, and think through all the access patterns. If the relationship is one-to-many, you can either denormalize the data, use a composite sort key, or use secondary indexes. If many-to-many, you start getting into adjacency lists and other advanced strategies. Alex DeBrie has some great resources here and here.

Efficient sqlite query based on list of primary keys

For querying an sqlite table based on a list of IDs (i.e. distinct primary keys) I am using following statement (example based on the Chinook Database):
SELECT * FROM Customer WHERE CustomerId IN (1,2,3,8,20,35)
However, my actual list of IDs might become rather large (>1000). Thus, I was wondering if this approach using the IN statement is the most efficient or if there is a better/optimized way to query an sqlite table based on a list of primary keys.
If the number of elements in the IN is large enough, SQLite constructs a temporary index for them. This is likely to be more efficient than creating a temporary table manually.
The length of the IN list is limited only be the maximum length of an SQL statement, and by memory.
Because the statement you wrote does not include any instructions to SQLite about how to find the rows you want the concept of "optimizing" doesn't really exist -- there's nothing to optimize. The job of planning the best algorithm to retrieve the data belongs to the SQLite query optimizer.
Some databases do have idiosyncrasies in their query optimizers which can lead to performance issues but I wouldn't expect SQLite to have any trouble finding the correct algorithm for this simple query, even with lots of values in the IN list. I would only worry about trying to guide the query optimizer to another execution plan if and when you find that there's a performance problem.
SQLite Optimizer Overview
IN (expression-list) does use an index if available.
Beyond that, I can't glean any guarantees from it, so the following is subject to a performance measaurement.
Axis 1: how to pass the expression-list
hardocde as string. Overhead for int-to-string conversion and string-to-int parsing
bind parameters (i.e. the statement is ... WHERE CustomerID in (?,?,?,?,?,?,?,?,?,?....), which is easier to build from a predefined string than hardcoded values). Prevents int → string → int conversion, but the default limit for number of parameters is 999. This can be increased by SQLITE_LIMIT_VARIABLE_NUMBER, but might lead to excessive allocations.
Temporary table. Possibly less efficient than any of the above methods after the statement is prepared, but that doesn't help if most time is spent preparing the statement
Axis 2: Statement optimization
If the same expression-list is used in multiple queries against changing CustomerIDs, one of the following may help:
reusing a prepared statement with hardcoded values (i.e. don't pass 1001 parameters)
create a temporary table for the CustomerIDs with index (so the index is created once, not on the fly for every query)
If the expression-list is different with every query, ist is probably best to let SQLite do its job. The following might be an improvement
create a temp table for the expression-list
bulk-insert expression-list elements using union all
use a sub query
(from my experience with SQLite, I'd expect it to be on par or slightly worse)
Axis 3 Ask Richard
the sqlite mailing list (yeah I know, that technology even older than rotary phones!) is pretty active with often excellent advise, including from the author of SQLite. 90% chance someone will dismiss you ass "Measure before asking suhc a question!", 10% chance someone gives you detailed insight.

Why no primary key

I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?
I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).
Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.
Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.
I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.
You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).
There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.

When do we really need a key/value database instead of a key/value cache server?

Most of the time,we just get the result from database,and then save it in cache server,with an expiration time.
When do we need to persistent that key/value pair,what's the significant benifit to do so?
If you need to persist the data, then you would want a key/value database. In particular, as part of the NoSQL movement, many people have suggested replacing traditional SQL databases with Key/Value pair databases - but ultimately, the choice remains with you which paradigm is a better fit for your application.
Use a key/value database when you are using a key/value cache and you don't need a sql database.
When you use memcached/mysql or similar, you need to write two sets of data access code - one for getting objects from the cache, and another from the database. If the cache is your database, you only need the one method, and it is usually simpler code.
You do lose some functionality by not using SQL, but in a lot of cases you don't need it. Only the worst applications actually leave constraint checking to the database. Ad-hoc queries become impractical at scale. The occasional lost or inconsistent record simply doesn't matter if you are working with tweets rather than financial data. How do you justify the added complexity of using a SQL database?

Serializing objects as BLOBs in Oracle

I have a HashMap that I am serializing and deserializing to an Oracle db, in a BLOB data type field.
I want to perform a query, using this field.
Example, the application will make a new HashMap, and have some key-value pairs.
I want to query the db to see if a HashMap with this data already exists in the db.
I do not know how to do this, it seems strange if i have to go to every record in the db, deserialize it, then compare, Does SQL handle comparing BLOBs, so i could have...select * from PROCESSES where foo = ?....and foo is a BLOB type, and the ? is an instance of the new HashMap?
Thanks
Here's an article for you to read: Pounding a Nail: Old Shoe or Glass Bottle
I haven't heard much about your application's underlying architecture, but I can tell you immediately that there is never a reason why you should need to use a HashMap in this way. Its a bad technique, plain and simple.
The answer to your question is not a clever Oracle query, its a redesign of your application's architecture.
For a start, you should not serialize a HashMap to a database (more generally, you shouldn't serialize anything that you need to query against). Its much easier to create a table to represent hashmaps in your application as follows:
HashMaps
--------
MapID (pk int)
Key (pk varchar)
Value
Once you have the content of your hashmaps in your database, its trivial to query the database to see if the data already exists or produce any other kind of aggregate data:
SELECT Count(*) FROM HashMaps where MapID = ? AND Key = ?
Storing serialized objects in a database is almost always a bad idea, unless you know ahead of time that you don't need to query against them.
How are you serializing the HashMap? There are lots of ways to serialize data and an object like a HashMap. Comparing two maps, especially in serialized form, is not trivial, unless your serialization technique guarantees that two equivalent maps always serialize the same way.
One way you can get around this mess is to use XML serialization for some objects that rarely need to be queried. For example, where I work we have a log table where a certain log message is stored as an XML file in a CLOB field. This xml data represents a serialized Java object. Normally we query against other columns in the record, and only read/write the blob in single atomic steps. However once or twice it was necessary to do some deep inspection of the blob, and using XML allowed this to happen (Oracle supports querying XML in varchar2 or CLOB fields as well as native XML objects). It's a useful technique if used sparingly.
Look into dbms_crypto.hash to make a hash of your blob. Store the hash alongside the blob and it will give you something to narrow down the search to something manageable. I'm not recommending storing the hash map, but this is a general technique for searching for an exact match between blobs.
See also SQL - How do you compare a CLOB
i cannot disagree, but i'm being told to do so.
i appreciate your solution, and that's sort of what i had previously.
thanks
I haven't had the need to compare BLOBs, but it appears that it's supported through the dbms_lob package.
See dbms_lob.compare() at http://www.psoug.org/reference/dbms_lob.html
Oracle can have new data types defined with Java (or .net on windows) you could define a data type for your serialized object and define how queries work on it.
Good lack if you try this...
If you serialize your data to xml, and store the data in an xml you can then use xpaths within your sql query. (Sorry as I am more of a SqlServer person, I don’t know the details of how to do this in Oracle.)
If you EVERY need to update only part of the serialized data don’t do this.
Likewise if any of the data is pointed to by other data or points to other data don’t do this.

Resources