Listing and Deleting Data from DynamoDB in parallel - aws-lambda

I am using Lambdas and SQS queue to delete the data from DynamoDB. Earlier when I was developing this I found that the only way to delete data from DyanmoDB is to gather the data you want to delete and deleting them in Batches.
At my current organization, most of the infrastructure is in serverless. Hence, I decided to make this piece following serverless and event driven architecture as well.
In a nutshell, I post a message on the SQS queue to delete items under particular partition. Once this message invokes my Lambda, I perform the listing call to DyanmoDB for 1000 items and do the following:
Grab the cursor from this listing call, and post another message to grab next 1000 items from this cursor.
import { DynamoDBClient } from '#aws-sdk/client-dynamodb';
const dbClient = new DynamoDBClient(config);
const records = dbClient.query(...fetchFirst1000ItemsForPrimaryKey);
postMessageToFetchNextItems();
From the fetched 1000 items:
I create a batches of 20 items, and issue set of messages for another lambda to delete these items. A batch of 20 items is posted for deletion until all 1000 have been posted for deletion.
for (let i = 0; i < 1000; i += 20) {
const itemsToDelete = records.slice(i, 20);
postItemsForDeletion(itemsToDelete);
}
Another lambda gets these items and just deletes them:
dbClient.send(new BatchWriteItemCommand([itemsForDeletion]))
The listing lambda receives call to read items from next cursor and the above steps ge t repeated.
This all happens in parallel. Get items, post message to grab next 1000 items, post messages for deletion of items.
While looking good on paper, this doesn't seem to delete all records from DynamoDB. There is no set pattern, there are always some items that remain in the DynamoDB. I am not entirely sure what could be happening but have a theory that parallel deletion and listing could be something that is causing the issue?
I was unable to find any documentation to verify my theory and hence this question here.

A batch write items call will return a list of unprocessed items. You should check for that and retry them.
Look at the docs for https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-dynamodb/classes/batchwriteitemcommand.html and seach for UnprocessedItems.
Fundamentally, a batch write items call is not a transactional write. It's possible for some item writes to succeed while others fail. It's on you to check for failures and retry them. I'm sorry I don't have a link for good sample code.

Related

Save consumer/tailer read offset for ChronicleQueue

I am exploring ChronicleQueue to save events generated in one of my application.I would like to publish the saved events to a different system in its original order of occurrence after some processing.I have multiple instances of my application and each of the instance could run a single threaded appender to append events to ChronicleQueue.Although ordering across instances is a necessity,I would like to understand these 2 questions.
1)How would the index of the read index for my events be saved so that I don't end up reading and publishing the same message from chronicle queue multiple times.
In the below code(picked from the example in github) the index is saved till we reach the end of the queue while we restarted the application.The moment we reach the end of the queue,we end up reading all the messages again from the start.I want to make sure for a particular consumer identified by a tailer Id, the messages are read only once.Do i need to save the read index in another queue and use that to achieve what I need here.
String file = "myPath";
try (ChronicleQueue cq = SingleChronicleQueueBuilder.binary(file).build()) {
for(int i = 0 ;i<10;i++){
cq.acquireAppender().writeText("test"+i);
}
}
try (ChronicleQueue cq = SingleChronicleQueueBuilder.binary(file).build()) {
ExcerptTailer atailer = cq.createTailer("a");
System.out.println(atailer.readText());
System.out.println(atailer.readText());
System.out.println(atailer.readText());
}
2)Also need some suggestion if there is a way to preserve ordering of events across instances.
Using a named tailer should ensure that the tailer only reads a message once. If you have an example where this doesn't happen can you create a test to reproduce it?
The order of entries in a queue are fixed when writing and all tailer see the same messages in the same order, there isn't any option.

IBM Integration Bus, best practices for calling multiple services

So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.

How to insert a batch of records into Redis

In a twitter-like application, one of the things they do is when someone posts a tweet, they iterate over all followers and create a copy of the tweet in their timeline. I need something similar. What is the best way to insert a tweet ID into say 10/100/1000 followers assuming I have a list of follower IDs.
I am doing it within Azure WebJobs using Azure Redis. Each webjob is automatically created for every tweet received in the queue. So I may have around 16 simultaneous jobs running at the same time where each one goes through followers and inserts tweets.I'm thinking if 99% of inserts happen, they should not stop because one or a few have failed. I need to continue but log it.
Question: Should I do CreateBatch like below? If I need to retrieve latest tweets first in reverse chronological order is below fine? performant?
var tasks = new List<Task>();
var batch = _cache.CreateBatch();
//loop start
tasks.Add(batch.ListRightPushAsync("follower_id", "tweet_id"));
//loop end
batch.Execute();
await Task.WhenAll(tasks.ToArray());
a) But how do I catch if something fails? try catch?
b) how do I check in a batch for a total # in each list and pop one out if it reaches a certain #? I want to do a LeftPop if the list is > 800. Not sure how to do it all inside the batch.
Please point me to a sample or let me have a snippet here. Struggling to find a good way. Thank you so much.
UPDATE
Does this look right based on #marc's comments?
var tasks = new List<Task>();
followers.ForEach(f =>
{
var key = f.FollowerId;
var task = _cache.ListRightPushAsync(key, value);
task.ContinueWith(t =>
{
if (t.Result > 800) _cache.ListLeftPopAsync(key).Wait();
});
tasks.Add(task);
});
Task.WaitAll(tasks.ToArray());
CreateBatch probably doesn't do what you think it does. What it does is defer a set of operations and ensure they get sent contiguously relative to a single connection - there are some occasions this is useful, but not all that common - I'd probably just send them individually if it was me. There is also CreateTransaction (MULTI/EXEC), but I don't think that would be a good choice here.
That depends on whether you care about the data you're popping. If not: I'd send a LTRIM, [L|R]PUSH pair - to trim the list to (max-1) before adding. Another option would be Lua, but it seems overkill. If you care about the old data, you'll need to do a range query too.

Delete a huge amount of files in Rackspace using fog

I have millions of files in my Rackspace Files. I would like to delete a part of them, passing lists of file names instead of deleting one by one, which is very slow. Is there any way to do this with fog? Right now, I have a script to delete each file, but would be nice to have something with better performance.
connection = Fog::Storage.new({
:provider => 'Rackspace',
:rackspace_username => "xxxx",
:rackspace_api_key => "xxxx",
:rackspace_region => :iad
})
dir = connection.directories.select {|d| d.key == "my_directory"}.first
CloudFileModel.where(duplicated: 1).each do |record|
f = record.file.gsub("/","")
dir.files.destroy(f) rescue nil
puts "deleted #{record.id}"
end
Yes, you can with delete_multiple_objects.
Deletes multiple objects or containers with a single request.
To delete objects from a single container, container may be provided and object_names should be an Array of object names within the container.
To delete objects from multiple containers or delete containers, container should be nil and all object_names should be prefixed with a container name.
Containers must be empty when deleted. object_names are processed in the order given, so objects within a container should be listed first to empty the container.
Up to 10,000 objects may be deleted in a single request. The server will respond with 200 OK for all requests. response.body must be inspected for actual results.
Examples:
Delete objects from a container
object_names = ['object', 'another/object']
conn.delete_multiple_objects('my_container', object_names)
Delete objects from multiple containers
object_names = ['container_a/object', 'container_b/object']
conn.delete_multiple_objects(nil, object_names)
Delete a container and all it's objects
object_names = ['my_container/object_a', 'my_container/object_b', 'my_container']
conn.delete_multiple_objects(nil, object_names)
To my knowledge, the algorithm included here is the most reliable and highest-performance algorithm for deleting a Cloud Files container along with any objects it contains. The algorithm could be modified for your purposes by including a parameter with the names of items to delete instead of calling ListObjects. At the time of this writing, there is no server-side functionality (i.e. bulk operation) capable of meeting your needs in a timely manner. Bulk operations are rate limited to 2-3 delete operations per second, so at least 55 minutes per 10,000 items you delete.
The following code shows the basic algorithm (slightly simplified from the syntax that is actually required in the .NET SDK). It assumes that no other clients are adding objects to the container at any point after execution of this method begins.
Note that you will be rate limited to a maximum of 100 delete operations per second per container which contains files. If multiple containers are involved, distribute your concurrent requests to round-robin the requests to each of the containers. Adjust your concurrency level to the value that approaches the hard rate limit. Using this algorithm has allowed me to reach long-term sustained deletion rates of over 450 objects/second when multiple containers were involved.
public static void DeleteContainer(
IObjectStorageProvider provider,
string containerName)
{
while (true)
{
// The only reliable way to determine if a container is empty is
// to list its objects
ContainerObject[] objects = provider.ListObjects(containerName);
if (!objects.Any())
break;
// the iterations of this loop should be executed concurrently.
// depending on connection speed, expect to use 25 to upwards of 300
// concurrent connections for best performance.
foreach (ContainerObject obj in objects)
{
try
{
provider.DeleteObject(containerName, obj.Name);
}
catch (ItemNotFoundException)
{
// a 404 can happen if the object was deleted on a previous iteration,
// but the internal database did not fully synchronize prior to calling
// List Objects again.
}
}
}
provider.DeleteContainer(containerName);
}

NSFetchedResultsController inserts lock up the UI

I am building a chat application using web-sockets and core-data.
Basically, whenever a message is received on the web-socket, the following happens:
check if the message exists by performing a core-data fetch using the id (indexed)
if 1. returns yes, update the message and perform a core-data save. if 1. returns no, create the message and perform a core-data save.
update table view, by updating or inserting new rows.
Here's my setup:
I have 2 default managed-object-contexts. MAIN (NSMainQueueConcurrencyType) and WRITER (NSPrivateQueueConcurrencyType). WRITER has a reference to the persisten store coordinator, MAIN does not, but WRITER is set as MAIN's parent.
TableView is connected to a NSResultsFetchController, connected to MAIN.
Fetches are all performed using temporary contexts ("performBlock:") that have MAIN as their parent. Writes look like this: Save temporary context, then save MAIN, then save WRITER.
Problem:
Because the updates come in via web-socket, in a busy chat-room, many updates happen in a short time. Syncs to fetch older messages can mean many messages coming in rapidly. And this locks up the UI.
I track the changes to the ui using fetched-results-controller's delegate like this:
// called on main thread
- (void)controllerWillChangeContent:(NSFetchedResultsController *)controller
{
NSLog(#"WILL CHANGE CONTENT");
[_tableView beginUpdates];
}
// called on main thread
- (void)controllerDidChangeContent:(NSFetchedResultsController *)controller
{
NSLog(#"DID CHANGE CONTENT");
[_tableView endUpdates];
}
and here's an example of what I see in the Log-file:
2014-07-14 18:46:20.630 AppName[4938:60b] DID CHANGE CONTENT
2014-07-14 18:46:22.334 AppName[4938:60b] WILL CHANGE CONTENT
That's almost 2 seconds per insert!
Is it simply a limitation I'm hitting here with tableviews? (I'm talking about 1000+ rows in some cases) But I can't imagine that's the case. UITableViews are super-optimized for that sort of operation.
Any obvious newbie-mistake I might be commiting?
This is not logical:
Writes look like this: Save temporary context, then save MAIN, then save WRITER.
If writer is a child context of main the changes are not persisted until the next save.
Ok I figured it out.
The problem is tableView:heightForRowAtIndexPath:. The fetches to calculate the height for the rows take time and each time tableView.endUpdates gets called, the UITableView needs the heights for ALL rows.
tableView:estimatedHeightForRowAtIndexPath: is a possible way to go (iOS7+), or I might opt for caching the heights myself (since the rows don't change) or just displaying fewer rows altog

Resources