Cassandra update and sort on same column - sorting

I'm looking for some inputs around cassandra data modelling for a timeline kind of feature. To store data for the timeline, I'm planning to use timeuuid in cassandra and make it as a clustering key. This will help in sorting the data. But the same data can be updated and I need to store the updated timeuuid corresponding the data so that it can be pushed up in the timeline. This involves fetching the previous data-timeuuid row, delete it and insert the new one. But doesn't seem to performant. How can I handle the sorting and updating on the same column (in my case timeuuid) to implement timeline feature.

I suggest this schema to you :
CREATE TABLE timeline_idx {
timeline_key text,
time timeuuid,
content_key text,
PRIMARY KEY ((partition_key), time)
}
CREATE TABLE timeline_content {
content_key text,
content blob,
PRIMARY KEY (content_key)
}
Timeline_idx is used to give you the content keys ordered as a timeline. Then you can retrieve the content in a second table called timeline_content. It is not ordered and there is no clustering key. You can update your content without knowing its timeuuid. I choose text type for timeline_key and content_key but you can choose whatever you want as long as it identifies timelines and contents uniquely.

Related

DynamoDb delete with sort key

I have fields below in dynamo dB table
event_on -- string type
user_id -- number type
event name -- string type
Since this table may have multiple records for user_id and event_on is the single field which can be unique so I made it primary key and user_id as sort key
Now I want to delete the all records of a user, so My code is
response = dynamodb.delete_item(
TableName=events,
Key={
"user_id": {"N": str(userId)}
})
It throwing error
Exception occured An error occurred (ValidationException) when calling
the DeleteItem operation: The provided key element does not match the
schema
also is there anyway to delete with range
Can someone suggest me what should I have do with dynamodb table structure to make this code work
Thanks,
It sounds like you've modeled your data using a composite primary key, which means you have both a partition key and a sort key. Here's an example of what that looks like with some sample data.
In DynamoDB, the most efficient way to access items (aka "rows" in RDBMS language) is by specifying either the full primary key (getItem) or the partition key (query). If you want to search by any other attribute, you'll need to use the scan operation. Be very careful with scan, since it can be a costly way (both in performance and money) to access your data.
When it comes to deletion, you have a few options.
deleteItem - Deletes a single item in a table by primary key.
batchWriteItem - The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests
TimeToLive - You can utilize DynamoDBs Time To Live (TTL) feature to delete items you no longer need. Keep in mind that TTL only marks your items for deletion and actual deletion could take up to 48 hours.
In order to effectively use any of these options, you'll first need to identify which items you want to delete. Because you want to fetch using the value of the sort key alone, you have two options;
Use scan to find the items of interest. This is not ideal but is an option if you cannot change your data model.
Create a global secondary index (GSI) that swaps your partition key and sort key values. This pattern is called an inverted index. This would allow you to identify all items with a given user_id.
If you choose option 2, your data would look like this
This would allow you to fetch all item for a given user, which you could then delete using one of the methods I outlined above.
As you can see here, delete_item needs the primary key and not the sort key. You would have to do a full scan, and delete everything that contains the given sort key.
If you are created a DynamoDB table by the Primary key and sort key, you should provide both values to remove items from that table.
If the sort key was not added to the primary key on the table creation process, the record can be removed by the Primary key.
How I solved it.
Actually, I tried to not add the sort key when created the table. And I'm using indexes for sorting and getting items.

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.
My use case :
Structure right now
PersonID | BatchDate | PersonJSON
When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.
If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.
If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.
Is bloom filter on row+col (personid+batchdate) an option ?
Any help is appreciated.
Thanks,
Abhishek
In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.
Another option would be Apache Phoenix, which provides support for secondary indexes.
I usually do two steps:
Create table one just have key is commbine of BatchDate+PersonId, value could be empty.
Create table two just as normal you did. Key is PersonId Value is the whole data.
For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

Incremental loading with pig?

I have a few sources of data (let's say these are users events).
Each source, has it's own events definitions.
While loading to HCatalog I create one table with event_definitions (event_id, event_code, event_source, evend_description), and one with events(event_is, date, etc...)
These tables are unions of all date from source tables.
event_definitions.event_id is a surrogate key, which is a foreign key for events.event_id
This key is taken from pig RANK function.
And everything works fine while initial loading.
But how do i serve incremetal loading? New values in event_definitions table must bigger surrogate key vale than last one. With RANK the is no posibility to start with exact numer. It always starts with number 1.
How do you serve these situations?
Regards
Paweł

Should one store a search_data tsvector in the same table or external table?

I am implementing full text search in postgres.
I would like to search all posts in my system. The posts fulltext index is an amalgamation of the post title and post body.
I have two ways of achieving this:
create a tsvector column in the posts table, trigger an update to it.
create a second table (posts_search) with a post_id and tsvector column containing the index data.
create a simple gin index ... (out of the question, cause my real world problem needs data in multiple tables for the index)
What is going to perform better, considering I sometimes need to filter down the search by other attributes in the table (like deleted_at is null and so on).
Is it a better approach to keep the tsvector column in the same table as the data (side effect select * now sucks) or a separate table (side effect, join required, index filtering is complicated)?
In my experiments, typical size of tsvector column is about 1% of the size of text field this tsvector was computed from using to_tsvector().
With this in mind, storing tsvector column in another table should provide performance benefit. For example, even if you do not use SELECT * (and you shouldn't, really), any seqscan in original single table will still have to load pages which contain original text. If you offload tsvector field to separate table, page loading will be faster by 100x.
In other words, I would favor second solution of offloading tsvector field to separate table. Or, alternatively, offloading posts (original text) deeper into your table hierarchy (but I guess it is almost the same thing).
Note that for full text search to work, original text is not necessary. You way want to even not store it in database, or store it in highly compressed format (and not necessarily easily accessible by SQL routines). It would work as long as something can create tsvector based on original text, or update when it changes.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Resources