NoSql: Enums vs Strings - enums

Just curious how others deal with enums & nosql? Is it better to store an attribute as an enum value or a string? Does this affect the size or performance of the database in some cases? For example, just think of, let's say, a pro sports player... his sport type could be Football, Hockey, Baseball, Basketball, etc... string vs enum, what do you all think?

You should be using enums in your code - strong typing helps avoid a lot of mistakes - and converting them to strings or numbers for storage.
Strings do require significantly more storage space - "Basketball" is 10-20 bytes depending on encoding, and if you store it as 4 it only needs 1 byte. However, there are very few cases where this will actually matter - if you have a million records, it is still less than 20MB difference in total database size. Strings are easier to work with and less likely to fail silently if the enumeration changes, so use strings.
Strings are also slower than numbers for most operations, including conversion to enum on load. However, the difference is orders of magnitude less than the time taken to retrieve anything at all from the database, so doesn't matter.

String are better of portability perspective. And Enum is not supported by popular DBMS's like MSSQL Server and many others.
You can have application level logic to prevent valid input against an array and just store it as String.
EDIT:
My preferences changed to String as CakePHP (where I do web apps) no-longer support Enum for portability concerns.

Related

How can I generate a unique identifier that is apparently not progressive [duplicate]

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.

Generalized Suffix Tree Java Implementation For Large Datasets

I have a collection of around 50 millions strings, each has around 100 characters. I am looking for very efficient (running time and memory usage) generalized suffix tree implementation.
I have tried https://github.com/npgall/concurrent-trees but it takes huge amount of memory eventhough the running time is efficient. With 2.5 million strings of length 100. It took like 50GB of memory already.
Not an ideal solution, but you could use enter link description here.
It has a CritBit1D version, were you can store arbitrary length keys.
Disadvantage #1:
You would have to convert your strings to long[] first, ie. 4-8 characters per long.
Disadvantage #2:
If you need a concurrent version, you would have to look at the Critbit64COW, which uses copy-on-write concurrency. However, this is not implemented for the Critbit1D yet, so you would need to do that yourself, using Critbit64COW as a template.
However, you could simply store only a 64bit hashcode as key, then you could use the CritBit64 (single-threaded) or CritBit64COW (multithreaded).
Btw, reading concurrently is not a problem, even with CritBit64.
Disclaimer: I'm the author of CritBit.

Best way to store 1 trillion lines of information

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:
288012413; 4855 18668 5.5677643628300215
the file is nearly 12 GB's.
That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?
Go ahead and use MySQL database
MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB
So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.
Well,
The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.
I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.
If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.
You might also want to look into compressing the file if it is just used to storing the results.
Reducing the 4th field to 3 decimal places should reduce the file to around 8GB.
If it's just array data, I would look into something like HDF5:
http://www.hdfgroup.org/HDF5/
The format is supported by most languages, has built-in compression and is well supported and widely used.
If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:
struct x {
long lineno;
short thing1;
short thing2;
double value;
}
and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.
well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.
Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.
Like AShelly, but smaller.
Assuming line #'s are continuous...
struct x {
short thing1;
short thing2;
short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp
}
save in binary file.
lseek() read() and write() are your friends.
file will be large(ish) at around 1.7Gb.
The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.
Then you've got a number of answers regarding reducing data size.
Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).
Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.
If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Advice on DB design Best Practices/Standard - Oracle

I'm designing the DB for a new app which is something I've done a thousand times, but in this occasion I suddenly start wondering on some aspects that I've never stopped before. Is there some standard/recommendation for the following things?
Whats the recommended data type for storing currencies (no financial operations, just displaying).
Recommended size for storing phone numbers (internationals)
Recommended minimum size for storing first names / last names (minimum meaning smallest maximum recommended size)
Recommended minimum size for storing comment blocks.(minimum meaning smallest maximum recommended size also)
I'm aware that every application has its own particular requirements to consider, but I feel that there must be something more specific than gut feeling and common sense.
Help, as always, will be deeply appreciated.
Whats the recommended data type for storing currencies
This depends on what kind of currency, and to what degree of accuracy.
If it's cents and dollars, rounded to the nearest cent, it's NUMBER(12,2) which allows you to store amounts between -999,999,999,999.99 and 999,999,999,999.99 - which for most currencies should be enough.
If you need to store intermediate results from, say, interest rate calculations, you may need more precision, e.g. NUMBER(15,5).
If you're talking Zimbabwean dollars, perhaps you should choose the maximum NUMBER instead :)
Recommended size for storing phone numbers (internationals)
VARCHAR2(30) should be sufficient. If it's too long your users will enter all sorts of rubbish data in there.
Recommended minimum size for storing first names / last names /
Recommended minimum size for storing comment blocks
These don't apply since you're in Oracle - use VARCHAR2, so you don't have to worry about minimum size. All you need to specify is the maximum size.
Currencies:
NUMBER(15,2), really depends on how big the numbers are that you expect to run into.
Phone numbers:
VARCHAR2(30), please don't hurt me if it should be larger - can't remember the length per se just that VARCHAR allows flexibility for formatting.
I don't see the point of looking at the minimum size if using VARCHAR2. The concerns for the physical model revolve around how much space the database will consume over time, assuming fields are maxed out.
Comment blocks:
Maximum of VARCHAR2(4000)
EDIFACT generally uses 35 as the size of a Name field and I'd copy that (and document that as a basis). Newer stuff tends to be defined in XML and doesn't normally go into field length definitions.
Alternatively the Canadian post office recommends no more than 40 characters per address line.
Note, that is characters and not bytes. Sizing should take into account multi-byte characters, but obviously not all names will be the maximum length. I've used ten characters per name as a broad approximation for sizing estimates but that could vary a lot between countries, ethnicities etc.
I know you were asking minimum size for comment blocks, but for large free-text areas you ought to consider using a CLOB value. Oracle is pretty smart about how these things are handled, how the data is stored, etc. You NEVER have to worry about size. In addition, you can usually pretend that they are VARCHAR2 columns for easy manipulation.

YouTube URL algorithm?

How would you go about generating the unique video URL's that YouTube uses?
Example:
http://www.youtube.com/watch?v=CvUN8qg9lsk
YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of
(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).
Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?
Well, it's enough for everyone on earth to produce video every single minute for 18000 years.
And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.
So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.
Refer to this video for detailed explanation.
Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).
This post by Jeff Attwood is a nice overview of the topic.
And here is an online hash calculator you can play with.
There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.
By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.
For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.
Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.
Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.
An example in PHP:
$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase
Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.
People are talking about these other systems:
Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
You can use any library or some languages like python provides it in standard library.
Example:
import secrets
id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
You could generate a GUID and have that as the ID for the video.
Guids are very unlikely to collide.
Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).
I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).
It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.
I suggest using a perfect hash function:
Perfect Hash Function for Human Readable Order Codes
As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.
The input numbers should be in sequence: 0, 1, 2, 3, and so on.
Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.
Just pick random values until you have one never seen before.
Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?
In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.

Resources