Can you sort a GET on a Cassandra column family by the Timestamp value created for each column entry, rather than the column Keys? - sorting

Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?

Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)

Related

Camel + CassandraQL : Process a table without putting all in memory

Goal: read a big Cassandra table, process line by line in parallel
Constraints:
not all rows in memory
no Spark, we have to use Camel
One shot, no need polling the table
I did a first working version with CassandraQL but this Component seems to be limited to one query with all in memory, I did not find mechanics like fetSize/fetchMoreResult. I looked CassandraProducer class, PollingConsumerPollingStrategy, ResultSetConversionStrategy... See nothing.
Could it be possible to read a table by chunks of 1000 elements for example, each chunk would generate an exchange lately split in different threads ?
I think that maybe the ProducerTemplate injecting first exchanges in the route could be the answer. But I don't undertand how I could manage the production exchange rate to avoid to have too many rows in memory (to do so, we would need for example to check the size of the next blocking queue, if more than X no consumed elements, wait before producing more messages).
Maybe there are other options to do something like this ?
Maybe I did not see the magic parameter in CassandraQL ?
Maybe we can override some parts of CassandraQL ?
Thank you
This is not going to be answer to be a your question but hope to kick off some discussion. But as someone learning Cassandra and spending bit of time, it made me thinking. And mainly targets fetSize/fetchMoreResult part of the question
First of all, two of your constraints are contradicting
Not all rows in memory
I don't want all them fetched at once
One shot, no need polling the partition
I don't want to go back to db more than once.
Unless what you actually you meant is
Not all rows in memory
I don't want all them fetched at once
You can go back to partition many times, as long as you go back straight to where you left it last time.
As long as the time it takes for the first page is same as time it takes for the second page. And the time it takes for the 19th Page is same as the time it takes for the 20th page.
i.e Not starting from the first row
So I am going to assume that what you meant is Second Scenario and go with it.
Queries for Cassandra are going to satisfy the following two:
They are going to have a restriction on clustering columns
They are already ordered by clustering columns
Now Consider the following table
department(partition key), firstName(clustering_key), personId(clustering_key), lastname, etc as normal cols
First query
select department, firstName, lastname, etc
from person
where department = 'depart1`
order by firstName ASC
limit 25;
Second query (lets say last record in the page had userId=25 and firstName=kavi)
select department, firstName, lastname, etc
from person
where department = 'depart1` and firstName='kavi' and userId > 25
order by firstName ASC
limit 25;
As you can see, we can easily construct a Cassandra query that brings each chunk with certain size in constant time.
Now back to integration framework
I remember this concept called watermark in mule where the endpoint can store and remember so that they can start from there next time. In this case, value of userId and firstName of the last record of the last page is the watermark. So they can issue the second. I am sure we should be able to do the same with camel
I hope I have convinced that polling is not an issue where each chunk is retrieved in constant time

How to get the last "row" in a cassandra's long row

In Cassandra, a row can be very long and store units of time relevant data. For example, one row could look like the following:
RowKey: "weather"
name=2013-01-02:temperature, value=90,
name=2013-01-02:humidity, value=23,
name=2013-01-02:rain, value=false",
name=2013-01-03:temperature, value=91,
name=2013-01-03:humidity, value=24,
name=2013-01-03:rain, value=false",
name=2013-01-04:temperature, value=90,
name=2013-01-04:humidity, value=23,
name=2013-01-04:rain, value=false".
9 columns of 3 days' weather info.
time is a primary key in this row. So the order of this row would be time based.
My question is, is there any way for me to do a query like: what is the last/first day's humidity value in this row? I know I could use a Order By statement in CQL but since this row is already sorted by time, there should be some way to just get the first/last one directly, instead of doing another sort. Or is cassandra optimizing it already with Order By statement under the hood?
Another way I could think of is, store another column in this row called "last_time_stamp" that always updates itself as new data is inserted in. But that would require one more update every time I insert new weather data.
Thanks for any suggestion!:)
Without seeing more of your actual table, I suggest using a timestamp (or timeuuid if there is a possibility for collisions) as the second component in a compound primary key. Using this, you can get the last "row" by selecting ORDER BY t DESC LIMIT 1.
You could also change the clustering order in your schema to order it naturally for "last N" queries.
Please see examples and linked resource in this answer.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Random exhaustive (non-repeating) selection from a large pool of entries

Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources