Is there any way in Redis to put keys without any common regx pattern in the same hash slots? - caching

I have the following data model for which I want to use redis as cache.
Employee: With a unique Employee_Id.
Department: With a unique Department_Id.
An employee can be part of only one department, a department can have many employees. Now, the operations the system should support are something like this.
Given a Employee_Id, find the department it is a part of.
Given a Department_Id, find the list of all it's employees.
Merge to departments, in this case employees of any one department will move to other, depending on the least no.of db operations.
I'm using DynamoDB as the persistent storage with two tables representing Employee and Department. I'm performing merge operations using dynamodb transactions, to ensure ACID.
Now, I'm planning to use redis as a cache between service and db. For each employee_Id as key, I'll store the department it is part of. For each department_id as key, I'll store the the list of members in the department. Now, for merge usecase I'll have to update the values for a no.of employee -> department mapping. For this I want to use redis transactions or operations like MSET, MGET etc.
For transactions in redis, we need to ensure that all keys are in the same hash slots. However, in our case EmployeeId(Key) are randomly generated UUID, they will not have any common regx. pattern to use for hash-tags. But, the values that they point to, i.e. Depatment_id will be common for them.
Is there any way in Redis to put keys(employee_Id) without any common regx pattern in same hash slots?
I'll put all such entries(for which I might want to perform transactions in future) in redis at the same time, hence I was thinking of appending a random string as hash-tag (between '{' and '}') to the keys but while getting value for the key, I'll not know the random the random string added, I need fetch values based on the original keys only.

Related

Database: Storing multiple Types in single table or multiple intermediate tables for Delta Tables

Using Java and Oracle.
We need to update changes in Email, UserID of employee to third party.
Actual table is Employee and intermediate table we keep which we will use for comparison of changes before sending to third party.
Following are database designs coming in mind for intermediate table:
Only Single table:
EmployeeiD|Value|Type|UpdateDate
Value is userid or email, type will be 'email' or 'userid'. Update date is kept so to figure out that which of email or userid was different and update to third party.
Multiple Table:
Employee_EmailID
EmpId|EmailID|Updatedate
Employee_UserID
EmpId|UserID|Updatedate
Java flow will be:
Pick employee from actual table.
Pick employee from above intermediate table.
Compare differences. Update difference to third party.
Update above table with updated value and last update date.
Which one is consider as best way, single table approach or multiple table or is there any standard way to implement the same? There are 10,000 Employees in system.
Intermediate table is just storing Delta records i.e Records transferred to third party so that it can be compared next day.
Good database design has separate tables for different concepts. Using the same database column to hold different types of data will lead to code which is harder to understand, prone to data corruption and less performative.
You may think it's only two tables and a few tens of thousands of rows, so does it matter? But that is only your current requirement. What you choose now will set the template for what happens when (say) you need to add telephone numbers to the process.
Now in future if we get 5 more entities to update
Do you mean "entities", like say Customers rather than Employees? Or do you really mean "attributes" as in my example of Employee Telephone Number?
Generally speaking we have a separate table for distinct entities, and all the attributes of that entity are grouped at the same cardinality. To take your example, I would expect an Employee to have one UserID and one Email Address so I would design the table like this:
Employee_audit
EmpId|UserID|EmailID|Updatedate
That is, I have one record which stores the complete state of the Employee record at the Updatedate.
If we add a new entity, Customers then we have a new table. Simple. But a new attribute like Employee Phone Number offers a choice, because an employee can have more than one: work landline, mobile, fax, home, etc. So we could represent this in three ways: a child table with a type column, multiple child tables for each type, or as distinct columns on the Employee record.
For the main Employee table I would choose the separate table (or tables, depending on whether I'm shooting for 6NF). But for an audit table I would choose one record per Employee and pivot the phone numbers like this:
Employee_audit
EmpId|UserID|EmailID|Landline|Mobile|Fax|Home|Updatedate
The one thing I would never do is have a single table with type and value columns. It seems attractive because it means we could track additional entities without any further DDL. But in fact it becomes harder to re-assemble the complete state of an Employee at any given time with each attribute we add. Also it means the auditing process itself is more complicated (because it needs to determine which attributes have changed and whether it needs to audit the change) and more expensive (because changing three attributes on the same record entails inserting three audit records).

DynamoDB Throughput vs Search time

I've just figured out a big mistake I had while creating the dynamodb structure.
I've created 11 tables, whereas one of them is the table mostly refereed to and the others are complementary tables.
For example, I have a table where I hold names (together with other info) called "Names" and another table called "NamesMappings" holding all these names added to the "Names" table so that each time a user wants to add a name to the "Names" table he first tries to put the name in "NamesMappings" and only if it succeed (therefore this name doesn't exist) he can add the name into the "Names" table. This procedure helps if the name is not unique and is not the primary key in the "Names" table and with this technique I don't have to search inside the "Names" table if the name exists, but instead I can try to add it to the "NamesMappings" table and only if it succeed I know this is a unique name.
First of all, I would like to ask you if this is a common approach or there is a better one?
Next, I figured out that with this design I soon reached to 11 tables each has 5 provisioned capacity of read and write which leads to overall 55 provisioned read and write under the free-tier. Then I understood why I get all these payments each month, because as the number of tables is getting bigger, and I leave the provisioned capacity as default (both read/write capacity are 5) I get more and more provisioned capacity.
So, what should be my conclusion from this understanding? Should I try to reduce the number of tables even if it takes more effort to preform scanning and querying inside the table? Or should I split the table same as I do but reduce the capacity of these mappings tables used only for indication if an item exists or not in another table?
If I understand your problem correctly you're missing the whole concept of NoSQL Databases.
Your Names table should have a Hash key (which is similar to a Primary key) that has a uniformly generated identifier (an UUID is a great candidate). This would automatically make this Table queryable by this unique identifier. You said, however, that you don't know the ID but you only know the Name instead. This leads me to think you could create a Global Secondary Index (GSI) on the Name attribute inside the Names table so you can also query by Name. Up to this point, your table structure should look like this:
id | name
Both of them are independently queryable, which gives you a lot of flexibility already.
Now, let's say you want to add the NameMapping attribute (which I don't know how it looks like), you can simply add it under the Names table, getting rid of the NamesMappings table, greatly reducing the number of WCUs and RCUs across your account. Your table structure should now look like this:
id | name | mappings
where mappings is, let's say, a JSON object.
Since you can only query on top level attributes in DynamoDB, you can now perform a query against the name attribute which has a GSI configured. If the query returns nothing, then name is unique. But let's say you still need some data inside the mappings object, then you could query by name and, in your code, you could apply a map/filter/reduce operation on the mappings attribute and decide what to do next.
Remember that duplication is just OK in a NoSQL world. This may look scary if you come from a purely SQL background, but data should be stored in such a way in NoSQL databases that you should be able to fetch all the needed information in one go, therefore avoiding "joins" (joins are still possible in a NoSQL database, but since there are no strong relationships between entities, you need to perform these joins manually on the code level). To give you some real context, imagine you have a Orders table where you keep track of the ordered Products and the Store that the Order belongs to: you'd save both the Products and the Store objects (and not their IDs, as it would happen in the SQL way) inside the Order object, so if you want to query for a given OrderId in the future, you wouldn't need to make extra calls (aka "joins") to the Product/Store tables to fetch the information, since everything would already be stored inside the Order object.

Storing items in a map or in rows in Cassandra

I need to store users lists by customer in cassandra. There are two basic approaches I see:
A: create table users ( // one row per user
customer int, userId int, primary key (customer, userId),
login text, name text, email text
);
or
B: create table users ( // one row per customer
customer int primary key, users map<int, text>
);
where in the second approach I would store a JSON representation of the user data as "text".
I will have the following operations on the table:
insert / update / delete single user
read all users for a customer
read a single user by id and customer
Here are the questions:
1) For large users lists, B is a bad idea. What order of magnitude would "large" be?
2) Would you expect B to have better performance for small users lists? What order of magnitude would "small" be?
3) What other advantages / disadvantages do you see for A or B?
(For those who need to know: I'm using scala / datastax driver / phantom to access the database.)
I would stick with A, definitely.
Collections can have at most 64k queryable elements so that's your hard limit. And C* reads all the collection during queries, so you want to keep the collections as empty as possible to avoid huge read penalties.
I expect the performance to be of the same order of magnitude because both are sequential reads.
In B you will use not idempotent queries to update the collection. My mistake, it's a map, not a list.
A makes very easy to update your schema. In B you'd need to read-modify-write your records.
Stick with A.

typed data set; parent/child select and update with ONE trip to the database (for each op)?

Is it possible, using an ADO.NET typed DataSet containing two tables in a parent/child relationship, to populate the DataSet with ONE trip to the d/b (query could return one or two tables; if one, then result set has columns from both tables, right?), and to update the d/b with ONE trip to the d/b (call to generated stored proc, I guess).
By "is it possible", I mean is it possible to have Visual Studio (2012) automagically generate the classes and SQL code to make this happen?
Or am I kind of on my own? It's looking an awful lot like VS really wants to generate one d/b server round trip for each table involved.
*I guess the update stored proc would have to take table-typed parameters from both parent and child, and perform inserts/updates/deletes appropriately.
Yes, one round trip per table is the way to go.
(- It's certainly possible to use a join query to populate a datatable but VS will then be reluctant to generate update etc SQL. This may or may not be a problem, depending on what you intend to do with the dataset.)
But if you have two tables in a dataset, lets say customers - orders, then you would typically use two queries, and two trips to the db:
SELECT * FROM customers WHERE customers.customerid=#customerid
and
SELECT * FROM orders WHERE orders.customerid=#customerid
Somewhat more counter-intuitive is the situation where you want all customers and orders for one country:
SELECT * FROM customers WHERE customers.countryid=#countryid
and
SELECT orders.* FROM orders INNER JOIN customers ON customers.customerid=orders.customerid WHERE customers.countryid=#countryid
Note how the join query returns data from only one table, but uses the join to identify which rows to return.
Then, once you have the data in your dataset, you can navigate it using the getparentrow and getchildrows methods. This is how ADO.Net manages hierarchical data.
You do need this one-table-at-a-time approach, because, assuming you have foreign key constraints in your db, you need to insert and update in reverse order from delete.
EDIT Yes, this does mean that in some circumstances, depending on the data you want and the structure of your primary keys, you could end up with a humungous set of JOINS that still only pull the data from the table at the end of the hierarchy. This might seem wrong in terms of traditional SQL, but actually it's fine. The time you have lost in the multiple, more complex queries is saved by the reduced amount of data you have to pull back across the wire, compared with one big join query that would be returning multiple copies of the parent data.

SQL Azure and Membership Provider Tenant ID

What might be a good way to introduce BIGINT into the ASP.NET Membership functionality to reference users uniquely and to use that BIGINT field as a tenant_id? It would be perfect to keep the existing functionality generating UserIds in the form of GUIDs and not to implement a membership provider from ground zero. Since application will be running on multiple servers, the BIGINT tenant_id must be unique and it should not depend on some central authority generating these IDs. It will be easy to use these tenant_id with a SPLIT AT command down the road which will allow bucketing users into new federated members. Any thoughts on this?
Thanks
You can use bigint. But you may have to modify all stored procedures that rely on user ID. Making ID global unique is usually not a problem. As long as the ID is the primary key, database will force it to be unique. Otherwise you will get errors when inserting new data (in that case, you can modify ID and retry).
So the most important difference is you may need to modify stored procedures. You have a choice here. If you use GUID, you don't need to do anything. But it may be difficult to predict how to split the federation to balance queries. As pointed out in another thread (http://stackoverflow.com/questions/10885768/sql-azure-split-on-uniqueidentifier-guid/10890552#comment14211028_10890552), you can sort existing data at the mid point. But you don't know future data will be inserted in which federation. There's a potential risk that federations will become unbalanced, and you may need to merge and split them at a regular interval to keep them in shape.
By using bigint, you have better control over the key. For example, you have two federations. The first has ID from 1 to 10000, and the second has ID from 10001 to 20000. When creating a new user, you first check how many records are in each federation. Suppose federation 1 has 500 records and federation 2 has 1000 records, to balance the load, you choose to insert to federation 1, so you choose an ID between 1 and 10000. But using bigint, you may need to do more work to modify stored procedures.

Resources