Redis - Seeking Data Modeling Suggestions - sorting

I'm using Redis to store data logs from many analog sensors. My goal is to sort the data according to a log time stamp and extract the data from a specific datetime range. My originial data model was to use the sensor name as the key and have a hash for each timestamp and the value attached to the hashkey.
So. if I have SensorA, SensorB and SensorC, doing a Keys * would return 1. SensorA, 2. SensorB and 3. SensorC. Doing hget SensorB 20110111172900 would return, let's say 25.
The problem with the current modeling is that it doesn't allow sorting on the timestamp, or so I think since all I've tried has failed.
Would someone be able to suggest a data model that would allow sorting and extracting ranges of data, or suggest the proper sort arguments that would allow this in the data model above.

A sorted set is probably a better fit than a hash in this case.
The value would be a combination of timestamp and sensor value. The score would be the timestamp. Use ZRANGEBYSCORE to retrieve the values. Both read and write go from O(1) to O(Log(N)), but you gain the ability to return a range of values.
You could also use a list to get O(1) insertion. Reading would be O(N) for retrieving a specific entry, but getting the most recent entries would be O(1).

Related

Number of segments that cover a point

I'm designing a web service that calculates the number of online users of an arbitrary system.
The input data is the array of tuples (user_id, log_in_time, log_out_time). The service should index this data somehow and prepare data structures in order to efficiently answer the requests of the form: "How many users were online at every time point in (start_time, end_time)?". The response of the service is an array -- number of online users for each time point in the requested interval.
Complication: each user has a set of characteristics (i.e. age, gender, city). Is it possible to efficiently answer the request of the form: "How many users with age=x, city=y, gender=z were online at every time point in (start_time, end_time)?"
The time is an integer (timestamp).
I'm not going to answer this question fully because clearly it is a homework assignment, but you didn't declare it as such.
Assuming the time windows are small or the number of simultaneous online users within that window is small, simply solve the first problem, then filter by your demographic criteria.
If the number of simultaneous online users is large and filtering after the fact is too time consuming, then use something similar to a boost::multi_index to filter on the most sparse dimension first, then do your time range query.
Additionally, most relational databases will do these types of queries out of the box, so the simplest solution would be to store your data in a database with proper indexes and then create the very straightforward query.
Since your comment said that you didn't understand how to use a B-tree to do a range query, I'll explain it in my answer. You use a B-tree to look up the minimum of your time range query. The way a B-tree is structured is that successive leaves are adjacent to one another. You first do a logarithmic lookup on the minimum range query bound. This finds you the first point within that time range. Then, you do a linear scan from the starting point to the point where you exceed your maximum bound for your range query.
This means using a B-tree makes your query O(log(number_of_online_users) + length_of_time_interval).

What data structure should i use for caching ordered entities

I need to store in cache some items, like chat messages. I also need to slice these items on the key value range. For example (back to chat messages) the most common operation with cache will be getting chat messages from begin date to end date.
What data structure should I be considering? I was thinking about simple array, but it will work for O(n). Is there any data structure that will work faster?
You can use a self balancing binary search tree like Red-Black Tree which stores the entries in ordered fashion and will provide the insert,delete,search in O(logn) in both average and worst case.
So when you need the chat message between a date interval you can search your RB-tree for data range which are already ordered.
use an associative set, store you data in an array<data>, but use a hash table to store pair<data , arrayIndex>. this way you can search , insert and delete with o(1).

Data Structure, independent of volume of data in it

Is there any data structure in which locating a data is independent of its volume ?
"locating a data is independent of volume of data in it" - I assume this means O(1) for get operations. That would be a hash map.
This presumes that you fetch the object based on the hash.
If you have to check each element to see if an attribute matches a particular value, like your rson or ern or any other parts of it, then you have to make that value the key up front.
If you have several values that you need to search on - all of the must be unique and immutable - you can create several maps, one for each value. That lets you search on more than one. But they have to all be unique, immutable, and known up front.
If you don't establish the key up front it's O(N), which means you have to check every element in turn until you find what you want. On average, this time will increase as the size of the collection grows. That's what O(N) means.

What is the fastest way to store huge amount of unique strings?

I wonder what is the best way for storing huge amount of strings and checking for duplication.
We have to think about our priority:
duplicate check speed
inserting new string time
storage space on hard disk
random access time
What is the best solution, when our target is fast duplicate checking and inserting new strings time (no random access or storage space matter) ?
I think about SQL database, but which of DB's is best for this solution ?
If we use SQL DB, like MySQL, which storage engine will be the best ? (of course, we have to exclude memory because of data amount)
Use a hash function on the input string. the output hash would be the primary key/id of the record.
Then you can check if the DB has this hash/id/primary key:
If it doesnt: this is a new string; you add a new record including the string and hash as id.
If it does: check that the string from the loaded record is the same as the input string.
if the string is the same: it is a duplicate
if the string is different: this is a collision. Use a collision resolution scheme to resolve. (A couple of examples below)
You will have to consider which hash function/scheme/strength to use based on speed and expected number of strings and hash collision requirements/guarantees.
A couple of ways to resolve collisions:
Use a 2nd hash function to come up with a new hash in the same table.
Mark the record (e.g. with NULL) and repeat with a stronger 2nd hash function (with wider domain) on a secondary "collision" table. On query, if the string is marked as collided (e.g. NULL) then do the lookup again in the collision table. You might also want to use dynamic perfect hashing to ensure that this second table does not have further collisions.
Of course, depending on how persistent this needs to be and how much memory you are expecting to take up/number of strings, you could actually do this without a database, directly in memory which would be a lot faster.
You may want to consider a NoSQL solution:
Redis. Some of the use cases solved using Redis:
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
(Josiah L. Carlson is the author of Redis in Action)
http://www.paperplanes.de/2010/2/16/a_collection_of_redis_use_cases.html
memcached. Some comparisons between memcached and Redis:
http://www.quora.com/What-are-the-differences-between-memcached-and-redis
Is memcached a dinosaur in comparison to Redis?
http://coder.cl/2011/06/concurrency-in-redis-and-memcache/
Membase/Couchbase who counts OMGPOP's Draw Something as one of their success stories. Comparison between Redis and Membase:
What is the major difference between Redis and Membase?
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Some questions:
how large is the set of strings?
will the application be read heavy or write heavy? or both?
how often would you like data to be persisted to disk?
is there a N most recent strings requirement?
Hope this helps.
Generate Suffix trees to store strings . Ukkonen's algorithm as in http://www.daimi.au.dk/~mailund/slides/Ukkonen-2005.pdf will give some insight how to create Suffix tree .There are number of ways to store this suffix tree. But once generated , the lookup time is very low.

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

Resources