Sort by key in Cassandra - sorting

Let's assume I have a keyspace with a column family that stores user objects and the key of these objects is the username.
How can I use Hector to get a list of users sorted by username?
I tried to use a RangeSlicesQuery, paging works fine with this query, but the results are not sorted in any way.
I'm an absolute Cassandra beginner, can anyone point me to a simple example that shows how to sort a column family by key? Please ask if you need more details on my efforts.
Edit:
The result was not sorted because I used the default RandomPartitioner instead of the OrderPreseveringPartitioner in cassandra.yaml.
Probably it's better not to rely on the sorting by key but to use a secondary index.

Quoting Cassandra - The Definitive Guide
Column names are stored in sorted order according to the value of compare_with. Rows,
on the other hand, are stored in an order defined by the partitioner (for example,
with RandomPartitioner, they are in random order, etc.)
I guess you are using RandomPartitioner which
... return data in an essentially random order.
You should probably use OrderPreservingPartitioner (OPP) where
Rows are therefore stored
by key order, aligning the physical structure of the data with your sort order.
Be aware of inefficiency of OPP.
(edit on Mar 07, 2014)
Important:
This answer is very old now.
It is a system-wide setting. You can set in cassandra.yaml. See this doc. Again, OPP is highly discouraged. This document is for version 1.1, and you can see it is deprecated. It is likely that it is removed from latest version. If you do want to use OPP, you may want to revisit the architecture the architecture.

Or create a row called "meta:userNames" in same column family and put all user names as a look up hash. Something like that.
Users {
key: "meta:userNames" {david:david, paolo:paolo, victor:victor},
key: "paolo" {password:"*****", locale:"it_it"},
key: "david" {password:"*****", locale:"en_us"},
key: "victor" {password:"*****", locale:"en_uk"}
}
First query the meta:userNames columns (that are sorted) and use them to get the user rows. Don't try to get everything via single db query as in SQL driven databases. Use Cassandra as huge Hash Map which provides rapid random access to its data.

Related

Redis : Get all keys by providing one of the value in the values list

In redis I'm planning to store key as a unique string and value will be a list.
I have a use case where I need to do 2 things.
First, I need to get all the values associated with a key by providing the key as input.
Second, I want to get all the keys associated with a value by providing one of the value in the values list.
Second part is where I need the advice, how we can achive this ?
I cannot get all the keys or key value pair and loop through because I will have millions of entries in Redis.
As mentioned in the comment above the retrieving of all keys with associated value at will probably sometimes create a performance issue as this will be a run through large entries.As also suggested in the official documentation about retrieving data from the memory caches you can try and use the following Redis command to get the value and see if that is what can solve your purpose.
GET
MGET

In redis,how to get the keys descending order based on insertion order?

I am implementing a feed for a social network in that newly uploaded post should be served first and so on.I use hashes of posts as keys and posts as values.I need the posts in "newest first order".How to do it?
My idea is
Store the post and timestamp with hash as a key
Get all keys and timestamps
Sort the timestamps in descending order
Then use the respective keys to get th latest images
Question1:But this approach is not good.How to do it ?
EDIT:Question2:Please tell me what algorithm you use to serve feed.If feed is common for all users based on "newest-first",how to implement it?
This is my first time in backend.Please if the question is dumb.
Thanks.
Here are three options for you:
Use a sorted set, using the timestamp as score, and the post-hash as value. The post-hash is also the key in a hash where the actual posts are stored. Commands involved: ZADD, HSET, ZREVRANGEBYSCORE, HGET.
Use a sorted set, using the timestamp as score, and the post with metadata as value. Make sure "post with metadata" is unique, you can include the timestamp and user to achieve this. This will have better performance, but makes it a bit harder if you have to find a specific post. Commands involved: ZADD, ZREVRANGEBYSCORE, ZRANGEBYSCORE.
Use Redis Streams. If you want a uniform insert order independent of client time, Redis can set the timestamp for you. However, stream entries cannot be modified, so either users cannot edit posts, or whenever they edit the post is brought up as new. Commands involved: XADD, XREVRANGE, XDEL.
See:
Redis Commands
Introduction to Redis Streams

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Comment System using Redis Database System

I am trying to build a comment system using Redis database, I am currently using hashes to store the comment data, but the problem I am facing is that after 10 or 12 comments, comments lose their order and start appearing randomly, anyone know what data type should be used for building a commenting system using Redis, currently my hashes are of the form.
postid:comments commentid:userid "Testcomment"
Thanks, Any help will be appreciated.
Hashes are set up for quick access by key rather than retrieval in order. If you need items in a particular order, try a list or sorted set.
The reason it appears to work at first is an optimization for small sets - when you only have a small number of items a list is the most efficient structure, so that is what redis uses internally. When you get more items, an actual hashmap is needed for efficient querying and redis rearranges the data so that it is ordered by hash rather than by insertion order.
With my web app, I am using a format like this.
(appname):(postid):(comment id) - The hash of the posts
(appname):(postid):count - The latest comment id
And then I query the (appname):(postid):count key to get the amount of times I should run a loop that gets the contents of the (appname):(postid):(comment id) hash.
Example Code
$c = $redis->get('(appname):(postid):count');
for($i = 0; $i<$c; $i++) {
var_dump($redis->hgetall('(appname):(postid):'.$i));
}

Enumerate indexes on a Extensible Storage Engine (ESENT) table

Background
I'm writing an adapter for ESE to .NET and LINQ in a Google Code project called eselinq. One important function I can't seem to figure out is how to get a list of indexes defined for a table. I need to be able to list available indexes so the LINQ part can automatically determine when indexes can be used. This will allow much more efficient plans for user queries if appropriate indexes can be found.
There are two related functions for querying index information:
JetGetTableIndexInfo - get index information by tableID
JetGetIndexInfo - get index information by tableName
These only differ in how the related table is specified (name or tableid). It sounds like these would support the function I want but all the info levels seem to require that I already have a certain index to query information for. The only exception is JET_IdxInfoCount, but that only counts how many indexes are present.
JET_IdxInfo with its JET_INDEXLIST sounds plausible but it only lists the columns on a specific index.
Alternatives
I am aware that I could get the index information another way, like annotations on .NET types corresponding to database tables, or by requiring a index mapping be provided ahead of time. I think there's enough introspection implemented to make everything else work out of the box without the user supplying extra information, except for this one function.
Another option may be to examine the system tables to find related index objects, but this is would mean depending on an undocumented interface.
To satisfy this question, I want a supported method of enumerating the indexes (just the name would be sufficient) on a table.
You are correct about JetGetTableIndexInfo and JetGetIndexInfo and JET_IdxInfo. The twist is that the data is returned in a somewhat complex: a temporary table is returned containing a row for the index and then a row for each column in the table. To just get the index names you will need to skip the column rows (the column count is given by the value of the columnidcColumn column in the first row).
For a .NET example of how to decipher this, look at the ManagedEsent project. In the MetaDataHelpers.cs file there is a method called GetIndexInfoFromIndexlist that extracts all the data from the temporary table.

Resources