DynamoDb delete with sort key - aws-lambda

I have fields below in dynamo dB table
event_on -- string type
user_id -- number type
event name -- string type
Since this table may have multiple records for user_id and event_on is the single field which can be unique so I made it primary key and user_id as sort key
Now I want to delete the all records of a user, so My code is
response = dynamodb.delete_item(
TableName=events,
Key={
"user_id": {"N": str(userId)}
})
It throwing error
Exception occured An error occurred (ValidationException) when calling
the DeleteItem operation: The provided key element does not match the
schema
also is there anyway to delete with range
Can someone suggest me what should I have do with dynamodb table structure to make this code work
Thanks,

It sounds like you've modeled your data using a composite primary key, which means you have both a partition key and a sort key. Here's an example of what that looks like with some sample data.
In DynamoDB, the most efficient way to access items (aka "rows" in RDBMS language) is by specifying either the full primary key (getItem) or the partition key (query). If you want to search by any other attribute, you'll need to use the scan operation. Be very careful with scan, since it can be a costly way (both in performance and money) to access your data.
When it comes to deletion, you have a few options.
deleteItem - Deletes a single item in a table by primary key.
batchWriteItem - The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests
TimeToLive - You can utilize DynamoDBs Time To Live (TTL) feature to delete items you no longer need. Keep in mind that TTL only marks your items for deletion and actual deletion could take up to 48 hours.
In order to effectively use any of these options, you'll first need to identify which items you want to delete. Because you want to fetch using the value of the sort key alone, you have two options;
Use scan to find the items of interest. This is not ideal but is an option if you cannot change your data model.
Create a global secondary index (GSI) that swaps your partition key and sort key values. This pattern is called an inverted index. This would allow you to identify all items with a given user_id.
If you choose option 2, your data would look like this
This would allow you to fetch all item for a given user, which you could then delete using one of the methods I outlined above.

As you can see here, delete_item needs the primary key and not the sort key. You would have to do a full scan, and delete everything that contains the given sort key.

If you are created a DynamoDB table by the Primary key and sort key, you should provide both values to remove items from that table.
If the sort key was not added to the primary key on the table creation process, the record can be removed by the Primary key.
How I solved it.
Actually, I tried to not add the sort key when created the table. And I'm using indexes for sorting and getting items.

Related

DyanamoDB with AWS Lambda function Order by Desc Order with scan

I am trying to create AWS Lambda function using Node.js and try to scan records from dynamodb. But it gives me records in random order I would like to fetch top 5 records which are recently added in to table. I would like to sort based on Timestamp so can get latest 5 records. Any one have an idea please help me out.
dynamodb does not intend to support ordering in its scan operation. Order is supported in query operations.
To get the behavior you want you can do the following (with one caveat, see below):
Make sure that each record on your table has an attribute (let's call it x) which always holds the same value (does not matter which value, let's say the value is always "y")
define a global secondary index on your table. the key of that index should use x as the partition key (aka: "hash key") and the timestamp field as the sort key.
then you can issue a query action on that index. "Query results are always sorted by the sort key value" (see here) which is exactly what you need.
The caveat: this means that your index will hold all records of your table under the same partition key. This goes against best practices of dynamodb (see Choosing the Right DynamoDB Partition Key). It will not scale for large tables (more than tens of GB).

Alternative to ORA_HASH?

We are working with a table in a 3rd party database that does not have a primary key but does have a unique index.
I have therefore been looking at using the ORA_HASH function to produce a de facto unique Id by passing in the values of the columns in the unique index.
Unfortunately, I can already see that we have a few collisions, which means that we can't derive a unique id using this method.
Is there an alternative to ORA_HASH that would provide a unique id for a unique input?
I suppose I could generate an Id using DBMS_CRYPTO.Hash but I'd ideally like to get a numeric value.
Edit
The added complication is that I then need to store these records in another (SQL Server) database and then compare the records from the original and the replica tables. So rank doesn't help me here since records can be added or deleted in the original table.
DBMS_CRYPTO.HASH could be used to generate a high-bit hash (high enough to give you a very low, but not zero, chance of collisions), but it returns 'RAW' not 'NUMBER'.
To guarantee no collisions ever, you need a one-to-one hash function. As far as I know, Oracle does not provide one.
A practical approach would be to create a new table to map unique keys to a newly generated primary key. E.g., unique value ("ABC",123, 888) maps to 838491 (where you generated 838491 using a sequence).
You'd have to update the mapping table periodically, to account for inserted rows, and that would be a pain, but it would let you generate your own PKs and keep track of them without a lot of complication.
Have you tried:
DBMS_UTILITY.GET_HASH_VALUE (
name VARCHAR2,
base NUMBER,
hash_size NUMBER)
RETURN NUMBER;

Queries in Dynamodb

I have an application written in Nodejs that needs to find ONE row based on a city name (this could just be the table's name, different cities will be categorized as different tables), and a field named "currentJobLoads" which is a number. For example, a user might want to find ONE row with the city name "Chicago" and the lowest currentJobLoads. How can I achieve this in Dynamodb without scan operations(since scan would be slower and can only read so much data before it gets terminated)? Any suggestions would be highly appreciated.
You didn't specify what your current partition key and sort key for the table are, but I'm guessing the currentJobLoads field isn't one of them. So you would need to create a Global Secondary Index on the currentJobLoads field, at which point you will be able to run query operations against that field.

Cassandra update and sort on same column

I'm looking for some inputs around cassandra data modelling for a timeline kind of feature. To store data for the timeline, I'm planning to use timeuuid in cassandra and make it as a clustering key. This will help in sorting the data. But the same data can be updated and I need to store the updated timeuuid corresponding the data so that it can be pushed up in the timeline. This involves fetching the previous data-timeuuid row, delete it and insert the new one. But doesn't seem to performant. How can I handle the sorting and updating on the same column (in my case timeuuid) to implement timeline feature.
I suggest this schema to you :
CREATE TABLE timeline_idx {
timeline_key text,
time timeuuid,
content_key text,
PRIMARY KEY ((partition_key), time)
}
CREATE TABLE timeline_content {
content_key text,
content blob,
PRIMARY KEY (content_key)
}
Timeline_idx is used to give you the content keys ordered as a timeline. Then you can retrieve the content in a second table called timeline_content. It is not ordered and there is no clustering key. You can update your content without knowing its timeuuid. I choose text type for timeline_key and content_key but you can choose whatever you want as long as it identifies timelines and contents uniquely.

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.
My use case :
Structure right now
PersonID | BatchDate | PersonJSON
When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.
If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.
If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.
Is bloom filter on row+col (personid+batchdate) an option ?
Any help is appreciated.
Thanks,
Abhishek
In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.
Another option would be Apache Phoenix, which provides support for secondary indexes.
I usually do two steps:
Create table one just have key is commbine of BatchDate+PersonId, value could be empty.
Create table two just as normal you did. Key is PersonId Value is the whole data.
For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

Resources