I'm wondering whether it would be a good idea to use hashes (CityHash, Murmur and the like) as keys in a key-value store like Hazelcast. I'm expecting to have about 2,000,000,000 records (URLs) in the database, so collisions could happen. It wouldn't be super critical to lose some data through hash collisions, but of course it would be best to avoid them.
A record contains the URL, time stamp, status code. The main operations are inserting and looking up whether an URL already exists.
So, what would you suggest, given speed is relevant:
using an ID generator, or
using a hash algorithm like CityHash or Murmur, or
using the relevant string, an URL in this case, itself?
Hazelcast does not rely on hashCode/equals methods of the key object, instead it is using the MurMur hash of the binary representation of the key.
In short, you should not really worry about hash collisions.
Related
I need to store intermediary calculation results in the memory. The elements must be quick to add and remove by their ID field and the container must support quick lookup for the one with the biggest value in the quality field.
Internally I would implement it using bindings to boost::multi_index, but I would prefer to use a native solution.
Implementation would most likely require sorted/hash indices that translate the specified external fields into some internal object ID, and one extra hash map to hold the actual objects accessible by that internal object ID.
On a real time messaging application, I want to control if incoming message is unique. For this purpose, I am planning to insert a hash of incoming message as unique key in db and check if I get unique key exception. (ORA-00001 in oracle).
Is this an efficient way or is there a better way to consider for this case ?
For ones who want to know, program is written in java and as a db we use oracle.
If you're trying to get around the performance problem of uniqueness tests on very large strings, then this is a decent way of achieving it, yes.
You might need a way to deal with hash collisions, though, as the presence of a unique key would prevent different messages having the same hash from loading. One way would be to check for existing matching hashes and do a comparison test against the full text of the message. It would keep your index size down as you'd index on the hash not the message text, but Ii would not be completely foolproof as two identical messages could be loaded by different sessions if the timing was exactly right (or wrong, depending on your perspective).
We have a huge Redis database containing about 100 million keys, which maps phone numbers to hashes of data.
Once in a while all this data needs to be aggregated and saved to an SQL database. During aggregation we need to iterate over all the stored keys, and take a look at those arrays.
Using Redis.keys is not a good option because it will retrieve and store the whole list of keys in memory, and it take a loooong time to complete. We need something that will give back an enumerator that can be used to iterate over all the keys, like so:
redis.keys_each { |k| agg(k, redis.hgetall(k)) }
Is this even possible with Redis?
This would prevent Ruby from constructing an array of 100 million elements in memory, and would probably be way faster. Profiling shows us that using the Redis.keys command makes Ruby hog the CPU at 100%, but the Redis process seems to be idle.
I know that using keys is discouraged against building a set from the keys, but even if we construct a set out of the keys, and retrieve that using smembers, we'll be having the same problem.
Incremental enumeration of all the keys is not possible with the current Redis version.
Instead of trying to extract all the keys of a live Redis instance, you could just dump the database (bgsave) and convert the resulting dump to a json file, to be processed with any Ruby tool you want.
See https://github.com/sripathikrishnan/redis-rdb-tools
Alternatively, you can use the redis-rdb-tools API to directly write a parser in Python and extract the required data (without generating a json file).
I have an otherwise perfectly relational data schema in place for my Postgres 8.4 DB, but I need the ability to associate arbitrary key/value pairs with several of my tables, with the assigned keys varying by row. Key/value pairs are user-generated, so I have no way of predicting them ahead of time or wrangling orderly schema changes.
I have the following requirements:
Key/value pairs will be read often, written occasionally. Reads must be reasonably fast.
No (present) need to query off of the keys or values. (But it might come in handy some day.)
I see the following possible solutions:
The Entity-Attribute-Value pattern/antipattern. Annoying, but the annoyance would be generally offset by my ORM.
Storing key/value pairs as serialized JSON data on a text column. A simple solution, and again the ORM comes in handy, but I can kiss my future self's need for queries good-bye.
Storing key/value pairs in some other NoSQL db--probably a key/value or document store. ORM is no help here. I'll have to manage the separate queries (and looming data integrity issues?) myself.
I'm concerned about query performance, as I hope to have a lot of these some day. I'm also concerned about programmer performance, as I have to build, maintain, and use the darned thing. Is there an obvious best approach here? Or something I've missed?
That's precisely what the hstore datatype is for in PostgreSQL.
http://www.postgresql.org/docs/current/static/hstore.html
It's really fast (you can index it) and quite easy to handle. The only drawback is that you can only store character data, but you'd have that problem with the other solutions as well.
Indexes support "exists" operator, so you can query quite quickly for rows where a certain key is present, or for rows where a specific attribute has a specific value.
And with 9.0 it got even better because some size restrictions were lifted.
hstore is generally good solution for that, but personally I prefer to use plain key:value tables. One table with definitions, other table with values and relation to bind values to definition, and relation to bind values to particular record in other table.
Why I'm against hstore? Because it's like a registry pattern. Often mentioned as example of anti pattern. You can put anything there, it's hard to easy validate if it's still needed, when loading a whole row (in ORM especially), the whole hstore is loaded which can have much junk and very little sense. Not mentioning that there is need to convert hstore data type into your language type and convert back again when saved. So you get some overhead of type conversion.
So actually I'm trying to convert all hstores in company I'm working for into simple key:value tables. It's not that hard task though, because structures kept here in hstore are huge (or at least big), and reading/writing an object crates huge overhead of function calls. Thus making a simple task like that "select * from base_product where id = 1;" is making a server sweat and hits performance badly. Want to point that performance issue is not because db, but because python has to convert several times results received from postgres. While key:value is not requiring such conversion.
As you do not control data then do not try to overcomplicate this.
create table sometable_attributes (
sometable_id int not null references sometable(sometable_id),
attribute_key varchar(50) not null check (length(attribute_key>0)),
attribute_value varchar(5000) not null,
primary_key(sometable_id, attribute_key)
);
This is like EAV, but without attribute_keys table, which has no added value if you do not control what will be there.
For speed you should periodically do "cluster sometable_attributes using sometable_attributes_idx", so all attributes for one row will be physically close.
I'm using NSValueTransformers to encrypt attributes (strings, dates, etc.) in my Core Data model, but I'm pretty sure it's interfering with the sorting in my NSFetchedResultsController.
Does anyone know if there's a way to get around this? I suppose it depends on how the sort is performed; if it's always only performed directly on the database, then I'm probably out of luck. If it sorts on the objects themselves, then perhaps there's a way to activate the transformation before the sort occurs.
I'm guessing it's directly on the database, though, since the sort would be key in grabbing subsets of the collection, which is the main benefit of NSFetchedResultsController anyway.
Note: I should add that there's some strange behavior here... the collection doesn't sort in the first session (the session where the objects are created), but it does sort in subsequent sessions (where the objects already exist and are just being retrieved). So perhaps sorting does work with transformables, but maybe there is caveat in that they have to be saved first or something like that (?)
If you are sorting within the NSFetchedResultsController then it is against the store (i.e. database). However, you can perform a "secondary" sort against the results when they are in memory and therefore decrypted by calling -sortedArrayUsingDescriptors:
update
I believe your inconsistent behavior is probably based on what is already in memory vs. what is being read directly from disk.