How to Increment the ID value in Cassandra Table Automatically? - cassandra-2.0

I have a challenge when I am inserting a values in Cassandra table , I have one column name is "ID", this ID column values are increase the automatically like mysql auto_increment column. I think Counter DataType is not suitable in this Scenario. Please any one help me to design the Schema, I don't want use the UUID's also for Replace the ID column

In short I don't believe it is possible. The nature of Cassandra is that it does not do a read before write. With only one exception, lightweight transactions, but all they do is what's called "compare and swap", but there is no way, the autoincrement can be implemented on the server side.
Even with counters, you won't be able to achieve the desired result, if you increase the counter every time you add a record to the table, because you will not know whether the current value (even if it is totally consistent), is a result of an increment from your process, or from a concurrent process.
The only way is to implement this mechanism on the application side.

Related

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

how to generate id for message when using Kafka stream?

I want to have a try about implementing a normal chatting system after have read many artifles in confluent kafka. But I have met some problems when doing some structure design.
When using mysql as my data's db, I can give id to every meaningful message, like user_id in user table, message_id for message table. After having id in model table, it is very convinient for client and server doing some comunication.
But in Kafka stream, how can I give every meaningful model a unique id in KTable? Or is it really necessary for me to do this?
Maybe I can answer the question for myself.
In mysql, we can directly use sequenceId because all data will go to one place and then be auto allocated a new id. But when the table grows too large, we also need to split table to several little tables.In that case, we also should to regenerate the unique id for each record, because auto generated id in these tables is begun from 0.
Maybe it is the same in Kafka. When we only have one partition in kafka, we also can use the id from kafka generated id because all the message will go to only one place, so they will never be dumplicated. But when we want more partitions, we also have to be careful that these generated id from different partition is not global unique.
So what we should do is to generate id for ourself. UUID is a fast way to do this, but I we want to have a number, we can use a little algorithm to implement this. Maybe use the structure like this in a distributed enviroment:
[nodeid+threadId+current_time+auto_increased_number]

Why no primary key

I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?
I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).
Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.
Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.
I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.
You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).
There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.

database for enterprise level using oracle - normalization and duplication

I am developing an enterprise application with an Oracle backend. I am designing a core part of the DB architecture now and im having some questions on it.
First and most important thing is, most of my tables needs to preserve old data. For example
Consider a table with the fields
Contract No, Contract Name, Contract Person, Contract Email
I have a records like
12, xxx, yyy, xxx#zzz.ccc
and some one modifies it to
12, xxx, zzz, xxx#zzz.ccc
at any point of time i need to display the new record while still have copy of the old record.
So what i thought was to put a duplicate record of the old data and update the fields that was changed and have a flag to keep track of active records with something like "is active" as 1.
The downside is that this creates redundancy in the table and seems like a bad design. But any other model seems unnecessarily complex and this seems cleaner to me. Also i dont see any performance issues having a duplicate record too. So please let me know if this is ok or am i missing something here.
Some times where there is a one to many relationship my assumption is to have a mapping table where i map the multiple entity in individual records by repeating master ID and changing child ID in each record. Is this a right way to do it or is there a better way to do it.
Is there a book on database best practices.
Thanks.
The database im dealing with is Oracle 11g on a two node RAC cluster
Also i dont see any performance issues having a duplicate record too.
Assume you have a row that, over time, has 15 updates to it. If you don't store any temporal data (if you don't store different versions of the row), you end up storing one row. If you do store temporal data, you end up storing 15 rows.
You also need more indexes, because the id number is no longer sufficient to identify a single row.
If you have only relatively small tables, you probably won't see any performance difference. (There will be one, but it probably won't be noticeable to users.) But a table that has 10 million rows will perform differently than a table that has 150 million rows. (15 versions per row, times 10 million rows.)
Some times where there is a one to many relationship my assumption is
to have a mapping table where i map the multiple entity in individual
records by repeating master ID and changing child ID in each record.
Is this a right way to do it or is there a better way to do it.
You probably need to know which child rows belong to which parent rows. So you need more than a single master id for the key. The master id alone doesn't tell you which version of that row in the parent table applies to a given child row.
Is there a book on database best practices.
There are books on temporal databases. The first one that I know of is Snodgrass's Developing Time-Oriented Database Applications in SQL. It's available in several formats, and it's free. It's also kind of old, but the information in it is important to understand if you're going to be building a temporal database. Also, think about reading Date's book Temporal Data and the Relational Model.
Wikipedia has an article that summarizes the ideas behind temporal databases.
Is normalization completely mandatory.
That's a meaningless question. You will have different issues with tables normalized to 2NF than you'll have with tables normalized to 5NF or 6NF.
I would keep the old/history records in a separate table. Create an upd/del trigger to populate your audit/history table for you, and keep only the most current data in your main table.
See here for an example. Many other similar examples exists in SO.

Primary Key Effect on Performance in SQLite

I have an sqlite database used to store information about backup jobs. Each run, it increases approximately 25mb as a result of adding around 32,000 entries to a particular table.
This table is a "map table" used to link certain info to records in another table... and it has a primary key (autoincrement int) that I don't use.
sqlite will reserve 1, 2, 4, or 8 bytes for INT column depending on its value. This table only has 3 additional columns, also of INT type.
I've added indexes to the database on the columns that I use as filters (WHERE) in my queries.
In the presence of indexes, etc. and in the situation described, do primary keys have any useful benefit in terms of performance?
Note: Performance is very, very important to this project - but not if 10ms saved on a 32,000 entry job means an additional 10MB of data!
A primary key index is used to look up a row for a given primary key. It is also used to ensure that the primary key values are unique.
If you search your data using other columns, the primary key index will not be used, and as such will yield no performance benefit. Its mere existence should not have a negative performance impact either, though.
An unnecessary index wastes disk space, and makes INSERT and UPDATE statements execute slower. It should have no negative impact on query performance.
If you really don't use this id what don't you drop this column + primary key? The only reason to keep a non-used primary key id column alive is to make it possible to create a master-detail relation with another table.
Another possibility is to keep the column but to drop the primary key. That will mean that the application has to take care of providing a unique id with every insert statement. Before and after each batch operation you have to check whether this column is still unique. This doesn't work in for instance MySQL and Oracle because of multi concurrency issues but it does work in sqlite.

Resources