Username uniqueness validation - Design Approach - algorithm

This is a general design problem - I want to validate a username field for uniqueness when the user enters the value and tabs out. I do a Ajax validation and get a response from the server. This is all very standard. Now, what if I have a HUGE user database ? How to handle this situation ? I want to find if a username "foozbarz" is present among 150Million usernames ?
Database queries are out of question [EDIT] - Read the username database once and populate the cache/hash for faster lookup (to clarify Emil Vikström's point)
In memory databases wont help either
Keep an in-memory hash (or cache/memcache) to store all usernames - usernames can be easily hashed and lookup will be very fast. But there are some problems with this:
a. Size of the hash - can we optimize so that we can reduce the hash size ?
b. Hash/cache refresh frequencies (users might get added while we are validating)
Shard the username table based on some criteria (e.g.: A-B in table username_1 and so on) - thanks piotrek for this suggestion
Or, any other better approach ?

why don't you simply partition the data? if you have/plan to have 150M+ users i assume you have/will have budget for this. if you are just starting (with 2k users) do it traditional way with simple indexed search on database. when you have so many users that you observe performance issues and measure that this is because of your database (and not e.g. www server) then you simply put another database. on the first one you will have users with name from a to m and rest on the other one. you may choose other criterion, like hash, to make data be balanced. when you need more you will add more databases. but if you don't have so many users right now, i advise you not to do any premature optimizations. there are many things that may become a bottleneck with this amount of data

You are most likely right about doing some kind of hashing where you store the taken names and, obviously, not hashed means it's free.
What you shouldn't do is rely on that validation. There can be a lot of time between user pressing Register and user checking if name is free.
To be fair, you only have one issue here and that's consideration for whether you REALLY need to worry whether you will get 150 million users. Scalability is often an issue, but unless this happens over night, you can probably swap in a better solution before this happens.
Secondly, your worry about both users getting a THIS NAME IS FREE and then one taking it. First of all, the chances of that happening are pretty damn low. Secondly, the only ways I can think of ‘solving’ this in a way where user will never click OK with validated name and get a USERNAME TAKEN is to either
a) Remember what user validated last, store that, and if someone else registers that in a mean time, use AJAX to change the name field to taken and notify the user. Don't do this. A lot of wasted cycles and really too much effort to implement.
b) Lock usernames as user validates one, for a short period of time. This results in a lot of free usernames coming up as taken when they actually aren't. You probably don't want this either.
The easiest solution for this is to simply put hash things into the table as users actually click OK, but before doing that, check if the name exists again. If it does, just send the user back with USERNAME TAKEN. The chances of someone racing someone else for a name are really, really slim and I doubt anyone will make a big fuss over how your validator (which did its job, the name was free at the point of checking) ‘lied’ to the user.
Basically your only issue is how you want to store the nicknames.

Your #1 criteria is flawed because this is exactly what you have a database system for: to store and manage data. Why do you even have a table with usernames if you're not going to read it?
The first thing to do is improving the database system by adding an index, preferably a HASH index if your database system supports it. You will have a hard time writing anything near the performance of this yourself.
If this is not enough, you must start scaling your database, for example by building a clustered database or by partitioning the table into multiple sub-tables.
What I think is a fair thing to do is implement caching in front of the database, but for single names. Not all usernames will have a collision attempt, so you may cache the small subset where the collisions typically happen. A simple algorithm for checking the collision status of USER:
Check if USER exist in your cache. If it does:
Set a "last checked" timestamp for USER inside the cache
You are done and USER is a collision
Check the database for USER. If it does exist:
Add USER to the cache
If the cache is full (all X slots is used), remove the least recently used username from the cache (or the Y least recently used usernames, if you want to minimize cache pruning).
You are done and USER is a collision
If it didn't match the cache or the db, you are done and USER is NOT a collision.
You will of course still need a UNIQUE contraint in your database to avoid race conditions.

If you're going the traditional route you could use an appropriate index to improve the database lookup.
You could also try using something like ElasticSearch which has very low latency lookups on large data sets.

If you have 150M+ users, you will have to have in place some function that:
Checks that the user exists, and signals if not found
Verifies the password is correct, and signals if it is not
Retrieves the user's data
This problem you will have, and will have to solve it. In all likelihood with something akin to a user's query. Even if you heavily rely on sessions, still you will have the problem of "finding session X among many from a 150M+ pool", which is structurally identical to "finding user X among many from a 150M+ pool".
Once you solve the bigger problem, the problem you now have is just its step #1.
So I'd check out a scalable database solution (possibly a NoSQL one), and implement the "availability check" using that.
You might end with a
retrieveUserData(user, password = None)
which returns the user info if user and password are valid and correct. For the availability check, you would send no password, and expect an UserNotFound exception if the username is available.

Related

Tables being affected by current transactions in Oracle

We are designing a Web-based application with Oracle backend for Conference Room reservation. Also - we have decided to use optimistic locking because we expect the number of collisions to be on the lower side.
Now - in case of optimistic locking, there is always a possibility of "Data Already Modified by Somebody Else" scenario.
And our UI involves quite a few fields to be entered. And displaying a message such as "Data has already been modified" is not a pleasant experience for the end-user; especially after he/she has entered say 15+ field.
What I am contemplating is displaying "Collision Probability" when the end user starts his/her UI session based upon first few entries he/she has entered on the UI.
This CP (collision probability) will be dynamically calculated by taking into account the database sessions which are in progress against given table/columns.
For example - if both Person A and Person B are viewing information for Conference Room X - then both will be shown a higher CP (collision probability).
In such case - either of them can decide to wait few seconds. This will be better than re-entering all the data.
Now my question : In Oracle, is there a way to determine which sessions are going against which tables and ROWID ?
"our UI involves quite a few fields to be entered."
This seems like the sort of problem which could be solved by better flow. Minimise the number of fields a user needs to enter before they can get a list of available suitable rooms. When they choose a room, use pessimistic locking to ensure nobody else can snatch the room while they are completing the booking application. Stash a copy of the initial fields so they can re-run the original query if they change their mind.
Of course, this means maintaining a session and handling state, and we all know web applications suck at that. Which is a way of saying that we often use web technologies when they aren't suited to the application we're writing.

Database efficiency: references/pointers from table to table

I am working on learning databases and am unsure about something that doesn't seem to make any sense to me. In the relational model you are able to combine through references but always require a global sort of key in each table to be able to combine this information. That is obviously required in most cases, but I feel like in a perfect tree hierarchy set up of a database this is inefficient.
To explain this better I shall use the example of storing products in a database. Products have main categories and sub categories and these are very clear. (ie. Milk is a subcategory of Dairy which is a subcategory of Food, etc.)
I thought in cases like this the ability to store single or a list of references/pointers to tables in fields would take away a lot of search querying and storage requirements.
Here is a link to a simple pain layout I made to illustrate this:
Image (the table entry could have some command character like '|' after which it knows the following entry is a file directory so when the database initiates it knows to make a pointer there)
Since I am only learning to work with databases now I understand that I may just be missing some knowledge on the subject, but I don't seem to find anything when I try googling this problem. Any help explaining where to start or any confirmation that this may improve efficiency and where I could learn how to write this myself would be great.
The concept of "pointer" is useful only if the object you want to point to has a well-defined address that is at least as permanent as the pointer itself. If the address is less permanent, you could end up with a "dangling" pointer.
A row in the database does not necessarily have a permanent address.1 By referencing the row through a logical value (instead of the physical address), the reference stays valid even when the row physically moves.2 And to ensure that the value identifies exactly one row, it must be unique.3
As for storing the list of values (be it "pointers" or anything else) inside a single field, this violates the principle of atomicity and therefore the 1NF. There are very good reasons to avoid violating the 1NF, including the ability to maintain the referential integrity and utilize indexing. That being said, there are DBMSes that support arrays or even sub-tables within a single field, which may be useful on rare occasions.
1 For example, Oracle ROWID is constant as long as the row is not physically moved on disk, but that can happen in many situations that are part of the normal database operation. So aside from putting severe restrictions on how your database is used, you couldn't rely on the ROWID staying constant over the lifetime of the rows that reference it (which could be as long as the lifetime of the database itself).
2 I suppose it would be theoretically possible for a DBMS to keep track of all the pointers and update them when the row physically moves. However, I'm not aware of any DBMS that actually supports such "updatable" pointers in practice, probably because the underlying mechanism needed for that wouldn't be any more efficient than the standard "value-based" referencing.
3 And must obviously be non-NULL. Saying that the attribute (or combination thereof) is "non-NULL and unique", is synonymous to saying it's a "key". Ideally, the key should also be immutable (so there is no need for a cascading referential action such as ON UPDATE CASCADE).

Database design: Same table structure but different table

My latest project deals with a lot of "staging" data.
Like when a customer registers, the data is stored in "customer_temp" table, and when he is verified, the data is moved to "customer" table.
Before I start shooting e-mails, go on a rampage on how I think this is wrong and you should just put a flag on the row, there is always a chance that I'm the idiot.
Can anybody explain to me why this is desirable?
Creating 2 tables with the same structure, populating a table (table 1), then moving the whole row to a different table (table 2) when certain events occur.
I can understand if table 2 will store archival, non seldom used data.
But I can't understand if table 2 stores live data that can changes constantly.
To recap:
Can anyone explain how wrong (or right) this seemingly counter-productive approach is?
If there is a significant difference between a "customer" and a "potential customer" in the business logic, separating them out in the database can make sense (you don't need to always remember to query by the flag, for example). In particular if the data stored for the two may diverge in the future.
It makes reporting somewhat easier and reduces the chances of treating both types of entities as the same one.
As you say, however, this does look redundant and would probably not be the way most people design the database.
There seems to be several explanations about why would you want "customer_temp".
As you noted would be for archival purposes. To allow analyzing data but in that case the historical data should be aggregated according to some interesting query. However it using live data does not sound plausible
As oded noted, there could be a certain business logic that differentiates between customer and potential customer.
Or it could be a security feature which requires logging all attempts to register a customer in addition to storing approved customers.
Any time I see a permenant table names "customer_temp" I see a red flag. This typically means that someone was working through a problem as they were going along and didn't think ahead about it.
As for the structure you describe there are some advantages. For example the tables could be indexed differently or placed on different File locations for performance.
But typically these advantages aren't worth the cost cost of keeping the structures in synch for changes (adding a column to different tables searching for two sets of dependencies etc. )
If you really need them to be treated differently then its better to handle that by adding a layer of abstraction with a view rather than creating two separate models.
I would have used a single table design, as you suggest. But I only know what you posted about the case. Before deciding that the designer was an idiot, I would want to know what other consequences, intended or unintended, may have followed from the two table design.
For, example, it may reduce contention between processes that are storing new potential customers and processes accessing the existing customer base. Or it may permit certain columns to be constrained to be not null in the customer table that are permitted to be null in the potential customer table. Or it may permit write access to the customer table to be tightly controlled, and unavailable to operations that originate from the web.
Or the original designer may simply not have seen the benefits you and I see in a single table design.

How does Facebook do it?

Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....

Should I store reference data in my application memory, or in the database?

I am faced with the choice where to store some reference data (essentially drop down values) for my application. This data will not change (or if it does, I am fine with needing to restart the application), and will be frequently accessed as part of an AJAX autocomplete widget (so there may be several queries against this data by one user filling out one field).
Suppose each record looks something like this:
category
effective_date
expiration_date
field_A
field_B
field_C
field_D
The autocomplete query will need to check the input string against 4 fields in each record and discrete parameters against the category and effective/expiration dates, so if this were a SQL query, it would have a where clause that looks something like:
... WHERE category = ?
AND effective_date < ?
AND expiration_date > ?
AND (colA LIKE ? OR colB LIKE ? OR colC LIKE ?)
I feel like this might be a rather inefficient query, but I suppose I don't know enough about how databases optimize their indexes, etc. I do know that a lot of really smart people work really hard to make database engines really fast at this exact type of thing.
The alternative I see is to store it in my application memory. I could have a list of these records for each category, and then iterate over each record in the category to see if the filter criteria is met. This is definitely O(n), since I need to examine every record in the category.
Has anyone faced a similar choice? Do you have any insight to offer?
EDIT: Thanks for the insight, folks. Sending the entire data set down to the client is not really an option, since the data set is so large (several MB).
Definitely cache it in memory if it's not changing during the lifetime of the application. You're right, you don't want to be going back to the database for each call, because it's completely unnecessary.
There's can be debate about exactly how much to cache on the server (I tend to cache as little as possible until I really need to), but for information that will not change and will be accessed repeatedly, you should almost always cache that in the Application object.
Given the number of directions you're coming at this data (filtering on 6 or more columns), I'm not sure how much more you'll be able to optimize the information in memory. The first thing I would try is to store it in a list in the Application object, and query it using LINQ-to-objects. Or, if there is one field that is used significantly more than the others, or try using a Dictionary instead of a list. If the performance continues to be a problem, try using storing it in a DataSet and setting indexes on it (but of course you loose some code-simplicity and maintainability this way).
I do not think there is a one size fits all answer to your question. Depending on the data size and usage patterns the answer will vary. More than that the answer may change over time.
This is why in my development I built some intermediate layer which allows me to change how the caching is done by changing configuration (with no code changes). Every while we analyze various stats (cache hit ratio, etc.) and decide if we want to change cache behavior.
BTW there is also a third layer - you can push your static data to the browser and cache it there too
Can you just hard-wire it into the program (as long as you stick to DRY)? Changing it only requires a rebuild.

Resources