On a real time messaging application, I want to control if incoming message is unique. For this purpose, I am planning to insert a hash of incoming message as unique key in db and check if I get unique key exception. (ORA-00001 in oracle).
Is this an efficient way or is there a better way to consider for this case ?
For ones who want to know, program is written in java and as a db we use oracle.
If you're trying to get around the performance problem of uniqueness tests on very large strings, then this is a decent way of achieving it, yes.
You might need a way to deal with hash collisions, though, as the presence of a unique key would prevent different messages having the same hash from loading. One way would be to check for existing matching hashes and do a comparison test against the full text of the message. It would keep your index size down as you'd index on the hash not the message text, but Ii would not be completely foolproof as two identical messages could be loaded by different sessions if the timing was exactly right (or wrong, depending on your perspective).
Related
I've been using room for a while. I'm from a mysql background where you have to check the values of queries and stuff. In room, I find this a bit complicated because so far I can either declare the dao insert query as void or as long returning the rowId
If I return a long, I have to write a listener to notify the UI of success/failure
My question is, is this necessary? Do I need the return value of inserts/updates/deletes or are these queries guaranteed to succeed?
My question is, is this necessary?
This depends and is hopefully better explained (a least a little anyway) below.
Do I need the return value of inserts/updates/deletes or are these
queries guaranteed to succeed?
There is no guarantee that they will succeed. However, you may be able to assume they have or use CONFLICT handling.
Much could depend upon how the Entities are coded, for example say you had a simple table (Entity) with an id and a name and for simplicity that you have autogenerate = true and you never allow the id to be specified when inserting. Unless the database is massive (beyond storage device capacity) or tweaked. A unique id will always result.
If the name needs to be UNIQUE then you are introducing a facet that makes it more likely that the insert will not succeed. If you had the onConflictStrategy of the insert as IGNORE, then a duplicate wouldn't fail but you'd may want to know if nothing was inserted (-1 returned).
This is just one facet. The answer is really that you need to consider the design of the database and of the app itself. Personally I'd always go with informing the user at least of the abnormal/unexpected which probably then means yes it is necessary (typically it is easier to supress code than add new code).
I have inherited a datababase with tables that lack primary keys. It's an OLTP database. One of the tables in question has ~300k records, and has no primary key implemented, even though examining the rest of the schema tells me one column is used AS a primary key, ie being replicated in another table, with identical name, etc. ie. This is not an 'end of line' table
This database also does not implement FKs.
My question is - is there ANY valid reason for a table (in Oracle for that matter) NOT to have a primary key?
I think PK is mandatory for almost all cases. Lots of reasons will exist but I'll treat some of them.
prevent to insert duplicate rows
rows will be referenced, so it must have a key for it
I saw very few cases make tables without PK (e.g. table for logs).
Not specific to Oracle but I recall reading about one such use-case where mysql was highly customized for a dam (electricity generation) project, I think. The input data from sensors were in the order 100-1000 per second or something. They were using timestamps for each record so didn't need a primary key (like with logs/logging mentioned in another answer here).
So good reasons would be:
Overhead, in the case of high frequency transactions
Necessity or Un-necessity in that case
"Uniqueness" maintained or inferred by application, not by db
In a normalized table, if every record needs to be unique and every field is referenced in other tables, then having a PK additionally adds an index overhead and if the PK would never actually be used in any SQL query (imho, I disagree with this but it's possible). But it should still have a unique index encompassing all the fields.
Bad reasons are infinite :-)
The most frequent bad reason which is actually responsible for the lack of a primary key is when DBs are designed by application/code-developers with little or no DB experience, who want to (or think they should) handle all data constraints in the application.
Any valid reason? I'd say "No"--I'm a database guy--but there are places that insist on using the database as a dumb data store. They usually implement all integrity "constraints" in application code.
Putting integrity constraints into application code isn't usually done to improve performance. In fact, if you built one database that enforces all the known constraints, and you built another with functionally identical constraints only in application code, the first one would almost certainly run rings around the second one.
Instead, application-level constraints usually hope to increase flexibility. (And, in the process, some of the known constraints are usually dropped, which appears to improve performance.) If it becomes inconvenient to enforce certain constraints in order to bulk load some scruffy data, an application programmer can just side-step the application-level constraints for a little while, then clean up the data when it's more convenient.
I'm not a db expert but I remember a conversation with a friend who worked in the Oracle apps dept. who told me that this was done to handle emergencies. If there was a problem in some report being generated which you could fix by putting in a row, db level constraints often stand in your way. They generally implemented things like unique primary keys in the application rather than the database. It was inefficient but enough and for them and much more manageable in case of a disaster recovery scenario.
You need a primary key to enforce uniqueness for a subset of its columns (useful if you need to refer to individual rows). It also speeds up certain queries because of the index associated to it.
If you do not need that index, or that uniqueness constraint, then you may not need a primary key (the index does not come free).
An example that comes to mind are logging tables, that just record some data (that is never updated or queried for individual records).
There is a small overhead when inserting to a table with an index and you need an index if you have a primary key. Downside of course is that finding a row is very costly.
This is a general design problem - I want to validate a username field for uniqueness when the user enters the value and tabs out. I do a Ajax validation and get a response from the server. This is all very standard. Now, what if I have a HUGE user database ? How to handle this situation ? I want to find if a username "foozbarz" is present among 150Million usernames ?
Database queries are out of question [EDIT] - Read the username database once and populate the cache/hash for faster lookup (to clarify Emil Vikström's point)
In memory databases wont help either
Keep an in-memory hash (or cache/memcache) to store all usernames - usernames can be easily hashed and lookup will be very fast. But there are some problems with this:
a. Size of the hash - can we optimize so that we can reduce the hash size ?
b. Hash/cache refresh frequencies (users might get added while we are validating)
Shard the username table based on some criteria (e.g.: A-B in table username_1 and so on) - thanks piotrek for this suggestion
Or, any other better approach ?
why don't you simply partition the data? if you have/plan to have 150M+ users i assume you have/will have budget for this. if you are just starting (with 2k users) do it traditional way with simple indexed search on database. when you have so many users that you observe performance issues and measure that this is because of your database (and not e.g. www server) then you simply put another database. on the first one you will have users with name from a to m and rest on the other one. you may choose other criterion, like hash, to make data be balanced. when you need more you will add more databases. but if you don't have so many users right now, i advise you not to do any premature optimizations. there are many things that may become a bottleneck with this amount of data
You are most likely right about doing some kind of hashing where you store the taken names and, obviously, not hashed means it's free.
What you shouldn't do is rely on that validation. There can be a lot of time between user pressing Register and user checking if name is free.
To be fair, you only have one issue here and that's consideration for whether you REALLY need to worry whether you will get 150 million users. Scalability is often an issue, but unless this happens over night, you can probably swap in a better solution before this happens.
Secondly, your worry about both users getting a THIS NAME IS FREE and then one taking it. First of all, the chances of that happening are pretty damn low. Secondly, the only ways I can think of ‘solving’ this in a way where user will never click OK with validated name and get a USERNAME TAKEN is to either
a) Remember what user validated last, store that, and if someone else registers that in a mean time, use AJAX to change the name field to taken and notify the user. Don't do this. A lot of wasted cycles and really too much effort to implement.
b) Lock usernames as user validates one, for a short period of time. This results in a lot of free usernames coming up as taken when they actually aren't. You probably don't want this either.
The easiest solution for this is to simply put hash things into the table as users actually click OK, but before doing that, check if the name exists again. If it does, just send the user back with USERNAME TAKEN. The chances of someone racing someone else for a name are really, really slim and I doubt anyone will make a big fuss over how your validator (which did its job, the name was free at the point of checking) ‘lied’ to the user.
Basically your only issue is how you want to store the nicknames.
Your #1 criteria is flawed because this is exactly what you have a database system for: to store and manage data. Why do you even have a table with usernames if you're not going to read it?
The first thing to do is improving the database system by adding an index, preferably a HASH index if your database system supports it. You will have a hard time writing anything near the performance of this yourself.
If this is not enough, you must start scaling your database, for example by building a clustered database or by partitioning the table into multiple sub-tables.
What I think is a fair thing to do is implement caching in front of the database, but for single names. Not all usernames will have a collision attempt, so you may cache the small subset where the collisions typically happen. A simple algorithm for checking the collision status of USER:
Check if USER exist in your cache. If it does:
Set a "last checked" timestamp for USER inside the cache
You are done and USER is a collision
Check the database for USER. If it does exist:
Add USER to the cache
If the cache is full (all X slots is used), remove the least recently used username from the cache (or the Y least recently used usernames, if you want to minimize cache pruning).
You are done and USER is a collision
If it didn't match the cache or the db, you are done and USER is NOT a collision.
You will of course still need a UNIQUE contraint in your database to avoid race conditions.
If you're going the traditional route you could use an appropriate index to improve the database lookup.
You could also try using something like ElasticSearch which has very low latency lookups on large data sets.
If you have 150M+ users, you will have to have in place some function that:
Checks that the user exists, and signals if not found
Verifies the password is correct, and signals if it is not
Retrieves the user's data
This problem you will have, and will have to solve it. In all likelihood with something akin to a user's query. Even if you heavily rely on sessions, still you will have the problem of "finding session X among many from a 150M+ pool", which is structurally identical to "finding user X among many from a 150M+ pool".
Once you solve the bigger problem, the problem you now have is just its step #1.
So I'd check out a scalable database solution (possibly a NoSQL one), and implement the "availability check" using that.
You might end with a
retrieveUserData(user, password = None)
which returns the user info if user and password are valid and correct. For the availability check, you would send no password, and expect an UserNotFound exception if the username is available.
I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?
That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.
hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.
As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.
I have a Message entity that has a messageID property. I'd like to ensure that there's only ever one instance of a Message entity with a given messageID. In SQL, I'd just add a unique constraint to the messageID column, but I don't know how to do this with Core Data. I don't believe it can be done in the data model itself, so how do you go about it?
My initial thought is to use a validation method to do a fetch on the NSManagedObject's context for the ID, see if it finds anything but itself, and if so, fails the validation. I suspect this will work - but I'm worried about the performance of something like that. I went through a lot of effort to minimize the fetch requests needed for the entire import routine, and having it validate by performing a fetch for every single new message entity seems a bit excessive. I can get all pre-existing objects I need and identify all the new objects I need to insert into the store using just two fetch queries before I do the actual work of importing and connecting everything together. This would add a fetch to every single update or insert in addition to those two - which would seem to eliminate any performance advantage I had by pre-processing the import data in the first place!
The main reason this is an issue is that the importer can (potentially) run several batches concurrently on several threads and may include some overlapping/duplicate data that needs to ultimately result in just one object in the store and not duplicate entries. Is there a reasonable way to do this and does what I'm asking for make sense for Core Data?
The only way to guarantee uniqueness is to do a fetch. Fortunately you can just do a -countForFetchRequest:error: and check to see if it is zero or not. That is the least expensive way to guarantee uniqueness at this time.
You can probably accomplish this in the validation or run it in the loop that is processing the data. Personally I would do it above the creation of the NSManagedObject so that you do not have the unnecessary allocs when a record already exists.
I don't think there is a way to easily guarantee an attribute is unique without doing a lot of work on your own. You can, of course use CFUUIDCreate to create a globally unique UUID, which should be unique, even in a multithreaded environment. But...
The objectID (type NSManagedObjectID) of all managed objects is guaranteed to be unique within the persistent store coordinator. Since you can add arbitrarily many persistent stores to the coordinator, this guarantee basically guarantees that the objectIDs are globally unique. Why don't you use the objectID as your messageID? You can't, of course, change the objectID once it's assigned (and it won't get assigned until the context containing the inserted object is saved; until then it will be a temporary but still unique ID).
So you have a NSManagedContext for each thread, backed by the same persistent store, is that correct? And before you save the NSManagedContext, you'd like to make sure the messageID is unique, that is, that you are not updating an existing row, and that it is not in one of the other contexts, correct?
Given that model (correct me if I misunderstand), I think you'd be better served having one object that manages access to the persistent store. That way, all threads would update one context and you can do your validation in there, using Marcus's -countForFetchRequest:error: suggestion. Granted, that places a bottleneck on this operation.
Just to add my 2 cents: I think inconsistencies will occur sooner or later anyway, and the only way to mitigate them seems to be to do it on an application-level with rather complex code.
So in my case I decided to allow duplicate values for what are supposed to be "unique" fields.
I added code, however, that detects these problems later (e.g. when a fetch that should return 1 object returns more than 1) and fixes them when they occur (usually by deleting).
It's a "go ahead, make a mistake, ill fix it later for you"-strategy.
This is not ideal, of course, but a valid way to attack this problen, imho.