Datamapper, Ruby: maintaining ID consistency? - ruby

I'm writing a simple application using DataMapper. It is somewhat crucial that I maintain consistent IDs (serial property) in my database (which may change freely), so I wrote this simple script that goes through every record and fixes the IDs so that they stay consistent (e.g. 1, 2, 3...).
The problem is, every time I add a new record, it's added with a new ID that breaks the consistency - as if the previous records weren't fixed.
How can I prevent this behavior? Or rather, is there an easier way to maintain a logical progression of IDs? I have a distinct feeling I'm not supposed to alter the IDs by hand.

datamapper usually creates sequential ids
but this sequence can differ from your "logical order".
Examples:
you create the strip-objects in another sequence then you want them to be ordered
you create provisional strip-objects (prototypes) and delete them again
..
I think you should't try to force datamapper to use your ids then. Instead I recommend an extra field like "nekkoru_number" which you can calculate after your own ideas. In your case using a unique name instead of a number may be a good idea too.
Think also of use cases like
inserting an object later
reordering the objects

Related

What would be the most appropriate data structure given these requirements?

We are building Search API in our company for some of our entities - events, leagues and sports each of which has name property and we have difficulties implementing business requirements.
TL;DR; What will be the data structure addressing these business requirements better than basic Red-Black tree does?
What we are the business requirements?
Data structure needs to be sorted so following requirements are easier for implementation therefore insertion should not break this property.
Data structure needs to hold information about it's entities, so node key(entity's name property) will be used for searching, but the node needs to hold all the entities with name property starting with node key value.
Data structure needs to support deletion by id. Id is also a property of all entities.
It needs to support index search (up to 3 characters) so if someone searches for "aaa" every node with key between "aaaa.." and "aaaz" should appear. (ex. query = "aaa", index = "aaa", "aaab", "aaaab", "aaaz", result should be "aaa", "aaab", "aaaab").
We need to search by localized node key.
What we have done so far?
We started our first iteration using built-in red-black tree (SortedSet in C#) and for nodes we had structure that holds the name property of the entity and all related events to that name property. And with one helper method we satisfied business requirements (1), (2) and (4).
As our second iteration we had to support deletion so we created a map(Dictionary) of entity id's to references to entity objects put into the SortedSet. We do that because our request for deletion is only by id and we cannot recreate entity from id, so at addition we need to create such map. (maybe augumentation can help?) With this we secured requirement (3).
Now we need to support (5) however, with every iteration (business requirement we receive) it is getting harder and harder to implement and I almost feel like we need to change our data structure in order to address business criteria better.
Whats the problem with the localization?
We can create new SortedSet and re-use the implementation, but this comes with huge trade off. Let me elaborate.
We have 100 of clients, each of which has like 7-8 supported languages, languages in our system are unique per client so translations for one customer does not interfere with another (if someone wants to call it Soccer rather than Football, fine let it be.), besides that we have base languages (global for every client) which are basically default settings for newly create languages, so we can safely say that very large portion of client specific language (lets say english) is the same as the base one. Having said all of that, if we want to have accurate search for each client and locale individually we need to have index for each client and locale individually which on the other hand introduces massive amounts of duplication.
What I have thought so far?
I am not an expert in data structures myself, but I really want to make this right. Of course everything is possible with enough coding and hardware, but thats not the point.
I thought about implementing some binary tree (could be AVL, Red-Black, 2-3-4 etc.) and augment it to meet the requirements better than built in SortedSet does. This will hopefully solve a lot of the issue and workarounds we had to make so far and as I said address future requirements better so implementation is faster and more accurate, however like I said I am not an expert in data structures myself and sadly I am unable to map these business requirements to some data structure for the time frame I have, so without further a due, do you guys have any suggestions?
My suggestion here would be for your primary data structure to be a dictionary, keyed by product id, and the value is the product data. That gives you very quick insertion, and removal by product id.
For searching, provide a separate data structure that contains the product names and associated product ids.
class IndexEntry
{
string ProductName;
string ProductId; // or int, if ProductId is an integer
}
Since you allow customer-specific names, you'll have to add all those customer names to this index. Not a problem, but when you remove something by ID, you'll also have to remove the associated items from the other data structure. This will require a sequential search of the name index data structure to ensure that you get all the names associated with a particular product. That could be expensive, even if you use a tree structure.
To speed things up, you could have a "deleted" flag for those index entries, and then rebuild the structure periodically to remove the deleted items. That way, a deletion just requires a sequential scan. That's less than ideal, but if insertions and deletions are infrequent, quite acceptable.
The key, though, is to make your primary data structure that holds the product information indexed by product id. You can then build secondary indexes any way you want.

Is Laravel's 'pluck' method cheaper than a general 'get'?

I'm trying to dramatically cut down on pricey DB queries for an app I'm building, and thought I should perhaps just return IDs of a child collection (then find the related object from my React state), rather than returning the children themselves.
I suppose I'm asking, if I use 'pluck' to just return child IDs, is that more efficient than a general 'get', or would I be wasting my time with that?
Yes,pluck method is just fine if you are trying to retrieving a Single Column from tables.
If you use get() method it will retrieve all information about child model and that could lead to a little slower process for querying and get results.
So in my opinion, You are using great method for retrieving the result.
Laravel has also different methods for select queries. Here you can look Selects.
The good practice to perform DB select query in a application, is to select columns that are necessary. If id column is needed, then id column should be selected, instead of all columns. Otherwise, it will spend unnecessary memory to hold unused data. If your mind is clear, pluck and get are the same:
Model::pluck('id')
// which is the same as
Model::select('id')->get()->pluck('id');
// which is the same as
Model::get(['id'])->pluck('id');
I know i'm a little late to the party, but i was wondering this myself and i decided to research it. It proves that one method is faster than the other.
Using Model::select('id')->get() is faster than Model::get()->pluck('id').
This is because Illuminate\Support\Collection::pluck will iterate over each returned Model and extract only the selected column(s) using a PHP foreach loop, while the first method will make it cheaper in general as it is a database query instead.

Hbase Schema Nested Entity

Does anyone have an example on how to create an Hbase table with a nested entity?
Example
UserName (string)
SSN (string)
+ Books (collection)
The books collection would look like this for example
Books
isbn
title
etc...
I cannot find a single example are how to create a table like this. I see many people talk about it, and how it is a best practice in certain scenarios, but I cannot find an example on how to do it anywhere.
Thanks...
Nested entities isn't an official feature of HBase; it's just a way some people talk about one usage pattern. In this pattern, you use the fact that "columns" in HBase are really just a big map (a bunch of key/value pairs) to let you to model a dimension of cardinality inside the row by adding one column per "row" of the nested entity.
Schema-wise, you don't need to do much on the table itself; when you create a table in HBase, you just specify the name & column family (and associated properties), like so (in hbase shell):
hbase:001:0> create 'UserWithBooks', 'cf1'
Then, it's up to you what you put in it, column wise. You could insert values like:
hbase:002:0> put 'UsersWithBooks', 'userid1234', 'cf1:username', 'my username'
hbase:003:0> put 'UsersWithBooks', 'userid1234', 'cf1:ssn', 'my ssn'
hbase:004:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_12345', '<isbn>12345</isbn><title>mary had a little lamb</title>'
hbase:005:0> put 'UsersWithBooks', 'userid1234', 'cf1:book_id_67890', '<isbn>67890</isbn><title>the importance of being earnest</title>'
The column names are totally up to you, and there's no limit to how many you can have (within reason: see the HBase Reference Guide for more on this). Of course, doing this, you have to do your own legwork re: putting in and getting out values (and you'd probably do it with the java client in a more sophisticated way than I'm doing with these shell commands, they're just for explanatory purposes). And while you can efficiently scan just a portion of the columns in a table by key (using a column pagination filter), you can't do much with the contents of the cells other than pull them and parse them elsewhere.
Why would you do this? Probably just if you wanted atomicity around all the nested rows for one parent row. It's not very common, your best bet is probably to start by modeling them as separate tables, and only move to this approach if you really understand the tradeoffs.
There are some limitations to this. First, this technique only works to
one level deep: your nested entities can’t themselves have nested entities. You can still
have multiple different nested child entities in a single parent, and the column qualifier is their identifying attributes.
Second, it’s not as efficient to access an individual value stored as a nested column
qualifier inside a row, as compared to accessing a row in another table, as you learned
earlier in the chapter.
Still, there are compelling cases where this kind of schema design is appropriate. If
the only way you get at the child entities is via the parent entity, and you’d like to have transactional protection around all children of a parent, this can be the right way to go.

Random ID generation on Sign Up - Database Performance

I am making a site that each account will have an ID.
But, I didn't want to make it incrementable, meaning:
id=1
id=2
...
id=1000
What I want is to have random IDs:
id=2355
id=5647734
id=23532
...
(The reason is to avoid robots to check all accounts profiles by just incrementing a ID in URL - and maybe other reason, but that is not the question)
But, I am worried about performance on registration.
It will be something like this:
while (RANDOM_ID is not taken): generate new RANDOM_ID
On generating a new ID for the new account, I will query database (MySQL) to check if the ID exists, for each generation.
Is there any better solution for this?
Is there any disadvantage of using random IDs?
Thanks in advance.
There are many, many reasons not to do this:
Your solution, as written, is not transactionally-safe; two transactions at the same time could both generate the same "random" ID.
If you serialize the transaction in order to make it safe, you will slaughter performance because the query will keep every single collision row locked until it finds a spare ID.
Using a random ID as the primary key will fragment the hell out of your clustered index. This is bad enough with uuids - the whole point of an auto-generated identity column is so you can generate a safe sequence out of it.
Why not use a regular primary key, but just don't use that in any of your URLs? Generate a secondary non-sequential ID along with it - such as a uuid - index it, and use this column in any public-facing segments of your application instead of the primary key if you are really worried about security.
You can use UUIDs. It's a unique identifier generated based partly on timestamp. It's almost certainly guaranteed to be unique so you don't have to do a query to check.
i do not know what language you're using, but there should be library or sample code for this for most languages.
Yes you can use UUID but keep your auto_increment field. Just add a new field and set it so something like: md5(microtime(true).rand()) or whatever other method you like and use that unike key along the site to make the links instead to expose the primary key in urls.

Implementing User Defined Fields

I am creating a laboratory database which analyzes a variety of samples from a variety of locations. Some locations want their own reference number (or other attributes) kept with the sample.
How should I represent the columns which only apply to a subset of my samples?
Option 1:
Create a separate table for each unique set of attributes?
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
SAMPLE_ACID: sample_id (FK), vial_number
This option seems too tedious, especially as the system grows.
Option 1a: Class table inheritance (link): Tree with common fields in internal node/table
Option 1b: Concrete table inheritance (link): Tree with common fields in leaf node/table
Option 2: Put every attribute which applies to any sample into the SAMPLE table.
Most columns of each entry would most likely be NULL, however all of the fields are stored together.
Option 3: Create _VALUE_ tables for each Oracle data type used.
This option is far more complex. Getting all of the attributes for a sample requires accessing all of the tables below. However, the system can expand dynamically without separate tables for each new sample type.
SAMPLE:
sample_id*
sample_template_id (FK)
SAMPLE_TEMPLATE:
sample_template_id*
version *
status
date_created
name
SAMPLE_ATTR_OF
sample_template_id* (FK)
sample_attribute_id* (FK)
SAMPLE_ATTRIBUTE:
sample_attribute_id*
name
description
SAMPLE_NUMBER:
sample_id* (FK)
sample_attribute_id (FK)
value
SAMPLE_DATE:
sample_id* (FK)
sample_attribute_id (FK)
value
Option 4: (Add your own option)
To help with Googling, your third option looks a little like the Entity-Attribute-Value pattern, which has been discussed on StackOverflow before although often critically.
As others have suggested, if at all possible (eg: once the system is up and running, few new attributes will appear), you should use your relational database in a conventional manner with tables as types and columns as attributes - your option 1. The initial setup pain will be worth it later as your database gets to work the way it was designed to.
Another thing to consider: are you tied to Oracle? If not, there are non-relational databases out there like CouchDB that aren't constrained by up-front schemas in the same way as relational databases are.
Edit: you've asked about handling new attributes under option 1 (now 1a and 1b in the question)...
If option 1 is a suitable solution, there are sufficiently few new attributes that the overhead of altering the database schema to accommodate them is acceptable, so...
you'll be writing database scripts to alter tables and add columns, so the provision of a default value can be handled easily in these scripts.
Of the two 1 options (1a, 1b), my personal preference would be concrete table inheritance (1b):
It's the simplest thing that works;
It requires fewer joins for any given query;
Updates are simpler as you only write to one table (no FK relationship to maintain).
Although either of these first options is a better solution than the others, and there's nothing wrong with the class table inheritance method if that's what you'd prefer.
It all comes down to how often genuinely new attributes will appear.
If the answer is "rarely" then the occasional schema update can cope.
If the answer is "a lot" then the relational DB model (which has fixed schemas baked-in) isn't the best tool for the job, so solutions that incorporate it (entity-attribute-value, XML columns and so on) will always seem a little laboured.
Good luck, and let us know how you solve this problem - it's a common issue that people run into.
Option 1, except that it's not a separate table for each set of attributes: create a separate table for each sample source.
i.e. from your examples: samples from a boiler will have tank number, boiler temp, lot number; acid samples have vial number.
You say this is tedious; but I suggest that the more work you put into gathering and encoding the meaning of the data now will pay off huge dividends later - you'll save in the long term because your reports will be easier to write, understand and maintain. Those guys from the boiler room will ask "we need to know the total of X for tank grouped by this set of boiler temperature ranges" and you'll say "no prob, give me half an hour" because you've done the hard yards already.
Option 2 would be my fall-back option if Option 1 turns out to be overkill. You'll still want to analyse what fields are needed, what their datatypes and constraints are.
Option 4 is to use a combination of options 1 and 2. You may find some attributes are shared among a lot of sample types, and it might make sense for these attributes to live in the main sample table; whereas other attributes will be very specific to certain sample types.
You should really go with Option 1. Although it is more tedious to create, Option 2 and 3 will bite you back when trying to query you data. The queries will become more complex.
In fact, the most important part of storing the data, is querying it. You haven't mentioned how you are planning to use the data, and this is a big factor in the database design.
As far as I can see, the first option will be most easy to query. If you plan on using reporting tools or an ORM, they will prefer it as well, so you are keeping your options open.
In fact, if you find building the tables tedious, try using an ORM from the start. Good ORMs will help you with creating the tables from the get-go.
I would base your decision on the how you usually see the data. For instance, if you get 5-6 new attributes per day, you're never going to be able to keep up adding new columns. In this case you should create columns for 'standard' attributes and add a key/value layout similar to your 'Option 3'.
If you don't expect to see this, I'd go with Option 1 for now, and modify your design to 'Option 3' only if you get to the point that it is turning into too much work. It could end up that you have 25 attributes added in the first few weeks and then nothing for several months. In which case you'll be glad you didn't do the extra work.
As for Option 2, I generally advise against this as Null in a relational database means the value is 'Unknown', not that it 'doesn't apply' to a specific record. Though I have disagreed on this in the past with people I generally respect, so I wouldn't start any wars over it.
Whatever you do option 3 is horrible, every query will have join the data to create a SAMPLE.
It sounds like you have some generic SAMPLE fields which need to be join with more specific data for the type of sample. Have you considered some user_defined fields.
Example:
SAMPLE_BASE: sample_id(PK), version, status, date_create, name, userdata1, userdata2, userdata3
SAMPLE_BOILER: sample_id (FK), tank_number, boiler_temp, lot_number
This might be a dumb question but what do you need to do with the attribute values? If you only need to display the data then just store them in one field, perhaps in XML or some serialised format.
You could always use a template table to define a sample 'type' and the available fields you display for the purposes of a data entry form.
If you need to filter on them, the only efficient model is option 2. As everyone else is saying the entity-attribute-value style of option 3 is somewhat mental and no real fun to work with. I've tried it myself in the past and once implemented I wished I hadn't bothered.
Try to design your database around how your users need to interact with it (and thus how you need to query it), rather than just modelling the data.
If the set of sample attributes was relatively static then the pragmatic solution that would make your life easier in the long run would be option #2 - these are all attributes of a SAMPLE so they should all be in the same table.
Ok - you could put together a nice object hierarchy of base attributes with various extensions but it would be more trouble than it's worth. Keep it simple. You could always put together a few views of subsets of sample attributes.
I would only go for a variant of your option #3 if the list of sample attributes was very dynamic and you needed your users to be able to create their own fields.
In terms of implementing dynamic user-defined fields then you might first like to read through Tom Kyte's comments to this question. Now, Tom can be pretty insistent in his views but I take from his comments that you have to be very sure that you really need the flexibility for your users to add fields on the fly before you go about doing it. If you really need to do it, then don't create a table for each data type - that's going too far - just store everything in a varchar2 in a standard way and flag each attribute with an appropriate data type.
create table sample (
sample_id integer,
name varchar2(120 char),
constraint pk_sample primary key (sample_id)
);
create table attribute (
attribute_id integer,
name varchar2(120 char) not null,
data_type varchar2(30 char) not null,
constraint pk_attribute primary key (attribute_id)
);
create table sample_attribute (
sample_id integer,
attribute_id integer,
value varchar2(4000 char),
constraint pk_sample_attribute primary key (sample_id, attribute_id)
);
Now... that just looks evil doesn't it? Do you really want to go there?
I work on both a commercial and a home-made system where users have the ability to create their own fields/controls dynamically. This is a simplified version of how it works.
Tables:
Pages
Controls
Values
A page is just a container for one or more controls. It can be given a name.
Controls are linked to pages and represents user input controls.
A control contains what datatype it is (int, string etc) and how it should be represented to the user (textbox, dropdown, checkboxes etc).
Values are the actual data that the users have typed into the controls, a value contains one column for every datatype that it can represent (int, string, etc) and depending on the control type, the relevant column is set with the user input.
There is an additional column in Values which specifies which group the value belong to.
Each time a user fills in a form of controls and clicks save, the values typed into the controls are saved into the same group so that we know that they belong together (incremental counter).
CodeSpeaker,
I like you answer, it's pointing me in the right direction for a similar problem.
But how would you handle drop-downlist values?
I am thinking of a Lookup table of values so that many lookups link to one UserDefinedField.
But I also have another problem to add to the mix. Each field must have multiple linked languages so each value must link to the equivilant value for multiple languages.
Maybe I'm thinking too hard about this as I've got about 6 tables so far.

Resources