Performance impact of index datatype in MongoDB? - performance

I need a new Mongo collection that associates data with an IP address, the address being the collection key. I'm wondering if there's any performance advantage using the decimal notation of the IP adress (e.g. 3299551096 as an integer) instead of the dotted notation (e.g. "198.252.206.16" as a string).
I haven't found any evidence for or against, nor any performance comparison between integer and string indexes. Is there any reason to prefer one over the other?

An integer value storage requirement is smaller, but of course, not very significant. The sorting/indexing algorithm for a number would be slightly faster than a string normally, but the difference would be extremely small as the string is also very short.
I wouldn't expect a compelling performance difference between the two. If you're planning on storing IPV6 addresses, the issue will be that BSON (http://bsonspec.org/#/specification) doesn't have a simple data type for storing a 16-byte number, so it's not necessarily a natural fit to store as a number only.
In the end, I'd likely just use strings if you want to avoid doing translation from storage to screen, or if you want to make queries more natural to write for most of us :) :
db.ips.find({addr: "192.168.1.1"})
If using strings, I'd also suggest you consider storing as a fixed format string such as 192.168.001.001 if you want to do more complex searches, such as a range search. Since a string stored with a consistent fixed format will sort naturally, you can use it in more ways than you'd otherwise be able to. If ranges aren't important, it's not necessary to store this way.
With a fixed format, you could do a query like:
db.ips.find({ addr: {
$gte: "192.168.000.000",
$lte: "192.168.000.255" } })
That would find all IP addresses between (inclusive) 192.168.0.0 and 192.168.0.255.
Ideally, you'll have an index on the field either way:
db.ips.ensureIndex({ addr: 1 })

Related

Does adding an explicit column of SHA-256 hash for a CLOB field improve the search(exact match) performance on that CLOB field

We have a requirement to implement a table(probably an orable db table or a mssql db table) as follows:
One column stores a string value, the length of this string value is highly variable, typically from several bytes to 500 megabytes(occasionally beyond 1 gigabytes )
Based on above, we decided to use CLOB type in db.(using system file is not an option somehow)
The table is very large up to several millions of records.
One of most frequent and important operation against this table is searching records by this CLOB column and the search string needs to EXACTlY match this CLOB column value.
My question is besides adding an index on CLOB column, whether we need to do some particular optimisation to improve the search performance?
One of my team member suggested adding an extra column in which to calculate SHA-256 hash of CLOB column above and search by this hash value instead of CLOB column. In terms of his opinion, the grounds of doing so are hash values are equal length other than variable so that indexing on that makes search faster.
However, I don't think this way makes big difference because assuming adding an explicit hash improves search performance database should be intelligent enough to do it by its own, likely storing this hash value in some hidden places of db system. Why bother we developers do it explicitly, on the other hand, this hash value theoretically creates collision although it's rare.
The only benefit I can imagine is when the client side of database does a search of which the keyword is very large, you can reduce network roundtrip by hashing this large value to a small length value, therefore network transferring is faster.
So any database gurus, please shed lights on this question. Many thanks!
Regular indexes don't work on CLOB columns. Instead you would need to create an Oracle Text index, which is primarily for full text searching of key words/phrases, rather than full text matching.
In contrast by computing a hash function on the column data, you can then create an index on the hash value since it's short enough to fit in a standard VARCHAR2 or RAW column. Such a hash function can significantly reduce your search space when trying to find exact matches.
Further your concern over hash collisions, while not unfounded can be mitigated. First off, hash collisions are relatively rare, but when they do occur, the documents are unlikely to be very similar, so a direct text comparison could be used in situations where a collision is detected. Alternatively due to the way hashing functions work, where small changes to the original document result in significant changes in the hash value, and where the same change to different documents affects the hash value differently, you could compute a secondary hash of a subset (or super set) of the original text to act as a collision avoidance mechanism.

How to efficiently hash the ip-address

This is an interview question. I thought about some solution like multiway- hashing but could not find some thing elegant. Please suggest some good method.
Question:
You have 10 million IP addresses. (IPv4 4 byte addresses). Create a hash function for these IP addresses.
Hint: Using the IP's themselves as a key is a bad idea because there will be a lot of wasted space
Interesting, that such an interesting question did not have any interesting answer (sorry for tautology).
If you see it as a theoretical matter, then this link is what you need (there is even a superfast hash function written for you and ready to go):
http://www.kfki.hu/~kadlec/sw/netfilter/ct3/
Practical matter may be different. If your hash table is of reasonable size, you will have to handle collisions anyway (with linked lists). So ask yourself what use case will take place in the end? If your code will run within some secluded ecosystem, and the IP address is a-b-c-d, c and d are the most volatile numbers and d won't be null (assuming you don't handle networks), so a hash table of 64K buckets, and cd as a hash may well be satisfactory?
Another use case - TCP connection tracking where a client use ephemeral port that is assigned by kernel randomly (isn't it ideal for hashing?). The problem is the limited range: something like 32768-61000 which renders least significant byte more random than most significant byte. So you can XOR the most significant byte with the most volatile byte in IP address that can be zerro (c) and use it as a hash in your 64K table.
Because your input is random & size of table is smaller the address space any hash function that you design will have its own pathological data set which will make your hash function look bad. I think the interviewer wants to know your knowledge about existing hash function that are used as standards.
Following are few such hash functions :
MD5
SHA-1,SHA-2
Why these functions work better than other hash functions because their pathelogical data sets are difficult to find without using brute force algorithms. So if you have something as good as these than donot tell your interviewer (you can get a patent on it and get job in google).
For Hashing ip addresses use MD5 or SHA on it and truncate to the size of table and you are done.
Note:- Size of table must be prime to prevent bad hashing.
I have also the same question before.
To solve this, you should divide your data.
We know ip address is consequent.
table1 from 0.0.0.0 to 0.0.0.127 (they are all in New York town1)
table2 from 0.0.0.128 to 0.0.0.255 (they are all in New York town2)
....
Then, create a map like this.
0.0.0.0~0.0.0.127 -> address1
0.0.0.127~0.0.0.255 -> address2
......
Then, to get the address for the IP, just get value from map;
Note: all the data is in database, I don't think it cost lots of space, to get the address in 1s, you should cost several space to optimize the speed

Does field type matter in a MongoDB index?

Here are two examples of a document structure in MongoDB.
{
UserId: "123",
UserName: "Usain Bolt"
}
{
UserId: NumberLong(123),
UserName: "Usain Bolt"
}
If I were to create an index on UserID, would there be a difference in find query performance between the above two examples?
I don't have a specifically accurate answer on the matter, but here is my understanding:
Indexes
Indexed fields may be of any type, including (embedded) documents
Indexes are mostly likely hashed, regardless of the type, to be able to index the same way. If there were specific considerations to be observed for different types, the standard mongodb docs would make that distinction.
The answer given to a similar question here refers to code samples from mongodb, that suggests the index comparisons are pretty much the same regarding types, but probably performance geared towards the size of the index.
One thing to note about strings versus numbers when it comes to indexes is that the order for a number in a sort is generally well understood. However when you use a string Lexicographical Order will be used instead, so you need to be aware of the differences there.
Also, to follow up on the size angle mentioned by jdi, the string can be arbitrarily long, though that can be controlled of course, a long, which will be stored as the BSON double type will always be 8 bytes.

What is the fastest way to store huge amount of unique strings?

I wonder what is the best way for storing huge amount of strings and checking for duplication.
We have to think about our priority:
duplicate check speed
inserting new string time
storage space on hard disk
random access time
What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ?
I think about SQL database, but which of DB's is best for this solution ?
If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)
Use a hash function on the input string. the output hash would be the primary key/id of the record.
Then you can check if the DB has this hash/id/primary key:
If it doesnt: this is a new string; you add a new record including the string and hash as id.
If it does: check that the string from the loaded record is the same as the input string.
if the string is the same: it is a duplicate
if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)
You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.
A couple of ways to resolve collisions:
Use a 2nd hash function to come up with a new hash in the same table.
Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.
Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.
You may want to consider a NoSQL solution:
Redis. Some of the use cases solved using Redis:
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
(Josiah L. Carlson is the author of Redis in Action)
http://www.paperplanes.de/2010/2/16/a_collection_of_redis_use_cases.html
memcached. Some comparisons between memcached and Redis:
http://www.quora.com/What-are-the-differences-between-memcached-and-redis
Is memcached a dinosaur in comparison to Redis?
http://coder.cl/2011/06/concurrency-in-redis-and-memcache/
Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:
What is the major difference between Redis and Membase?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Some questions:
how large is the set of strings?
will the application be read heavy or write heavy? or both?
how often would you like data to be persisted to disk?
is there a N most recent strings requirement?
Hope this helps.
Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

How to match against a large collection efficiently

I have a large collection of objects of type foo. Each object of type foo has say 100 properties (all strings) plus an id. An object of type bar also has these 100 properties.
I want to find the matching object of type foo from the collection where all these properties match with that of bar.
Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo objects once and do the same for the bar object and match more efficiently?
The foos are in the thousands and the bars are in the millions.
Darth Vader has a point there... and I never thought that I'd be siding with the dark side!
I'll go over what I think are the best tools for the trade:
Embedded database: Google's LevelDB- it's faster than most database solutions out there.
Hashing function: Google's CityHash- it's fast and it offers excellent hashing!
JSON Serialization
The Embedded Database
The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.
The Hashing Function
A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.
JSON Serialization
JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.
The Conclusion
You can store billions of records in LevelDB and you will be able to quickly retrieve the exact value you're looking for just by providing the hash for it.
In order to generate a key, you can use JSON serialization and CityHash to hash the JSON string.
Use the key to find the matching object!
Enjoy!
If you have ALL matching properties. That means they are same objects actually. is that correct?
In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.
Whichever language you are using, you should override the gethashcode and equals methods to implement it.
If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).
Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.
Hash is very nice and simple to implement.. But i want suggest you that algorithm:
Map your 100 string properties to one big string(for example concatenate with fixed length for each property) that should unique id of this object. So we have 1000 string in first set, and 1mln strings in second.
The problem reduces to find for each strings in second set if first set contains it.
Make trie data structure on first set
Complicity of checking if string S in the trie is O(|S|). |S| - length of S.
So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.
UPDATE:
Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.

Resources