Delete a huge amount of files in Rackspace using fog - ruby

I have millions of files in my Rackspace Files. I would like to delete a part of them, passing lists of file names instead of deleting one by one, which is very slow. Is there any way to do this with fog? Right now, I have a script to delete each file, but would be nice to have something with better performance.
connection = Fog::Storage.new({
:provider => 'Rackspace',
:rackspace_username => "xxxx",
:rackspace_api_key => "xxxx",
:rackspace_region => :iad
})
dir = connection.directories.select {|d| d.key == "my_directory"}.first
CloudFileModel.where(duplicated: 1).each do |record|
f = record.file.gsub("/","")
dir.files.destroy(f) rescue nil
puts "deleted #{record.id}"
end

Yes, you can with delete_multiple_objects.
Deletes multiple objects or containers with a single request.
To delete objects from a single container, container may be provided and object_names should be an Array of object names within the container.
To delete objects from multiple containers or delete containers, container should be nil and all object_names should be prefixed with a container name.
Containers must be empty when deleted. object_names are processed in the order given, so objects within a container should be listed first to empty the container.
Up to 10,000 objects may be deleted in a single request. The server will respond with 200 OK for all requests. response.body must be inspected for actual results.
Examples:
Delete objects from a container
object_names = ['object', 'another/object']
conn.delete_multiple_objects('my_container', object_names)
Delete objects from multiple containers
object_names = ['container_a/object', 'container_b/object']
conn.delete_multiple_objects(nil, object_names)
Delete a container and all it's objects
object_names = ['my_container/object_a', 'my_container/object_b', 'my_container']
conn.delete_multiple_objects(nil, object_names)

To my knowledge, the algorithm included here is the most reliable and highest-performance algorithm for deleting a Cloud Files container along with any objects it contains. The algorithm could be modified for your purposes by including a parameter with the names of items to delete instead of calling ListObjects. At the time of this writing, there is no server-side functionality (i.e. bulk operation) capable of meeting your needs in a timely manner. Bulk operations are rate limited to 2-3 delete operations per second, so at least 55 minutes per 10,000 items you delete.
The following code shows the basic algorithm (slightly simplified from the syntax that is actually required in the .NET SDK). It assumes that no other clients are adding objects to the container at any point after execution of this method begins.
Note that you will be rate limited to a maximum of 100 delete operations per second per container which contains files. If multiple containers are involved, distribute your concurrent requests to round-robin the requests to each of the containers. Adjust your concurrency level to the value that approaches the hard rate limit. Using this algorithm has allowed me to reach long-term sustained deletion rates of over 450 objects/second when multiple containers were involved.
public static void DeleteContainer(
IObjectStorageProvider provider,
string containerName)
{
while (true)
{
// The only reliable way to determine if a container is empty is
// to list its objects
ContainerObject[] objects = provider.ListObjects(containerName);
if (!objects.Any())
break;
// the iterations of this loop should be executed concurrently.
// depending on connection speed, expect to use 25 to upwards of 300
// concurrent connections for best performance.
foreach (ContainerObject obj in objects)
{
try
{
provider.DeleteObject(containerName, obj.Name);
}
catch (ItemNotFoundException)
{
// a 404 can happen if the object was deleted on a previous iteration,
// but the internal database did not fully synchronize prior to calling
// List Objects again.
}
}
}
provider.DeleteContainer(containerName);
}

Related

Listing and Deleting Data from DynamoDB in parallel

I am using Lambdas and SQS queue to delete the data from DynamoDB. Earlier when I was developing this I found that the only way to delete data from DyanmoDB is to gather the data you want to delete and deleting them in Batches.
At my current organization, most of the infrastructure is in serverless. Hence, I decided to make this piece following serverless and event driven architecture as well.
In a nutshell, I post a message on the SQS queue to delete items under particular partition. Once this message invokes my Lambda, I perform the listing call to DyanmoDB for 1000 items and do the following:
Grab the cursor from this listing call, and post another message to grab next 1000 items from this cursor.
import { DynamoDBClient } from '#aws-sdk/client-dynamodb';
const dbClient = new DynamoDBClient(config);
const records = dbClient.query(...fetchFirst1000ItemsForPrimaryKey);
postMessageToFetchNextItems();
From the fetched 1000 items:
I create a batches of 20 items, and issue set of messages for another lambda to delete these items. A batch of 20 items is posted for deletion until all 1000 have been posted for deletion.
for (let i = 0; i < 1000; i += 20) {
const itemsToDelete = records.slice(i, 20);
postItemsForDeletion(itemsToDelete);
}
Another lambda gets these items and just deletes them:
dbClient.send(new BatchWriteItemCommand([itemsForDeletion]))
The listing lambda receives call to read items from next cursor and the above steps ge t repeated.
This all happens in parallel. Get items, post message to grab next 1000 items, post messages for deletion of items.
While looking good on paper, this doesn't seem to delete all records from DynamoDB. There is no set pattern, there are always some items that remain in the DynamoDB. I am not entirely sure what could be happening but have a theory that parallel deletion and listing could be something that is causing the issue?
I was unable to find any documentation to verify my theory and hence this question here.
A batch write items call will return a list of unprocessed items. You should check for that and retry them.
Look at the docs for https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-dynamodb/classes/batchwriteitemcommand.html and seach for UnprocessedItems.
Fundamentally, a batch write items call is not a transactional write. It's possible for some item writes to succeed while others fail. It's on you to check for failures and retry them. I'm sorry I don't have a link for good sample code.

Biztalk Debatched Message Value Caching

I get a file with 4000 entries and debatch it, so i dont lose the whole message if one entry has corrupting data.
The Biztalkmap is accessing an SQL server, before i debatched the Message I simply cached the SLQ data in the Map, but now i have 4000 indipendent maps.
Without caching the process takes about 30 times longer.
Is there a way to cache the data from the SQL Server somewhere out of the Map without losing much Performance?
It is not a recommendable pattern to access a database in a Map.
Since what you describe sounds like you're retrieving static reference data, another option is to move the process to an Orchestration where the reference data is retrieved one time into a Message.
Then, you can use a dual input Map supplying the reference data and the business message.
In this patter, you can either debatch in the Orchestration or use a Sequential Convoy.
I would always avoid accessing SQL Server in a map - it gets very easy to inadvertently make many more calls than you intend (whether because of a mistake in the map design or because of unexpected volume or usage of the map on a particular port or set of ports). In fact, I would generally avoid making any kind of call in a map that has to access another system or service, but if you must, then caching can help.
You can cache using, for example, MemoryCache. The pattern I use with that generally involves a custom C# library where you first check the cache for your value, and if there's a miss you check SQL (either for the paritcular entry or the entire cache, e.g.:
object _syncRoot = new object();
...
public string CheckCache(string key)
{
string check = MemoryCache.Default.Get(key) as string;
if (check == null)
{
lock (_syncRoot)
{
// make sure someone else didn't get here before we acquired the lock, avoid duplicate work
check = MemoryCache.Default.Get(key) as string;
if (check != null) return check;
string sql = #"SELECT ...";
using (SqlConnection conn = new SqlConnection(connStr))
{
conn.Open();
using (SqlCommand cmd = conn.CreateCommand())
{
cmd.CommandText = sql;
cmd.Parameters.AddWithValue(...);
// ExecuteScalar or ExecuteReader as appropriate, read values out, store in cache
// use MemoryCache.Default.Add with sensible expiration to cache your data
}
}
}
}
else
{
return check;
}
}
A few things to keep in mind:
This will work on a per AppDomain basis, and pipelines and orchestrations run on separate app domains. If you are executing this map in both places, you'll end up with caches in both places. The complexity added in trying to share this accross AppDomains is probably not worth it, but if you really need that you should isolate your caching into something like a WCF NetTcp service.
This will use more memory - you shouldn't just throw everything and anything into a cache in BizTalk, and if you're going to cache stuff make sure you have lots of available memory on the machine and that BizTalk is configured to be able to use it.
The MemoryCache can store whatever you want - I'm using strings here, but it could be other primitive types or objects as well.

Waiting for Realm writes to be completed

We are using Realm in a Xamarin app and have some issues refreshing the local database based on a remote source. Data is fetched from a remote endpoint and stored locally using Realm for easier/faster access.
Program flow is as follows:
Fetch data from remote source (if possible).
Loop through the entities returned by the remote source while keeping track of the IDs we've seen so far. New or updated entities are written to Realm.
Loop through the set of locally stored entities, removing entities we haven't seen in step 2 with Realm.Remove(entity); (in a transaction)
Return Realm.All<Entity>();
Unfortunately, the entities are returned by step 4 before all "remove" operations have been written. As a result, it takes a couple of refreshes before the local database is completely in sync.
The remove operation is done as follows:
foreach (Entity entity in realm.All<Entity>())
{
if (seenIds.Contains(entity.Id))
{
continue;
}
realm.Write(() => {
realm.Remove(entity);
});
}
Is there a way to have Realm wait till the transaction is completed, before returning the Realm.All<Entity>();?
I am pretty sure this is not particularly a Realm issue - the same pattern would cause problems with a lot of enumerable, mutable containers. You are removing items from a list whilst iterating it so enumeration is moving on too far.
There is no buffering on Realm transactions so I guarantee it is not about have Realm wait till the transaction is completed but is your list logic.
There are two basic ways to do this differently:
Use ToList to get a list of all objects from the All - this is expensive if many objects because you will instantiate all the objects.
Instead of removing objects inside the loop, add them to a list of items to be removed then iterate that list.
Note that using a transaction per-remove, as you are doing with Write here is relatively slow. You can do many operations in one transaction.
We are also working on other improvements to the Realm API that might give a more efficient way of handling this. It would be very helpful to know the relative data sizes - the number of removals vs records in the loop. We love getting sample data and schemas (can send privately to help#realm.io).
an example of option 2:
var toDelete = new List<Entity>();
foreach (Entity entity in realm.All<Entity>())
{
if (!seenIds.Contains(entity.Id))
toDelete.Add(entity);
}
realm.Write(() => {
foreach (Entity entity in toDelete))
realm.Remove(entity);
});

IBM Integration Bus, best practices for calling multiple services

So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.

How to insert a batch of records into Redis

In a twitter-like application, one of the things they do is when someone posts a tweet, they iterate over all followers and create a copy of the tweet in their timeline. I need something similar. What is the best way to insert a tweet ID into say 10/100/1000 followers assuming I have a list of follower IDs.
I am doing it within Azure WebJobs using Azure Redis. Each webjob is automatically created for every tweet received in the queue. So I may have around 16 simultaneous jobs running at the same time where each one goes through followers and inserts tweets.I'm thinking if 99% of inserts happen, they should not stop because one or a few have failed. I need to continue but log it.
Question: Should I do CreateBatch like below? If I need to retrieve latest tweets first in reverse chronological order is below fine? performant?
var tasks = new List<Task>();
var batch = _cache.CreateBatch();
//loop start
tasks.Add(batch.ListRightPushAsync("follower_id", "tweet_id"));
//loop end
batch.Execute();
await Task.WhenAll(tasks.ToArray());
a) But how do I catch if something fails? try catch?
b) how do I check in a batch for a total # in each list and pop one out if it reaches a certain #? I want to do a LeftPop if the list is > 800. Not sure how to do it all inside the batch.
Please point me to a sample or let me have a snippet here. Struggling to find a good way. Thank you so much.
UPDATE
Does this look right based on #marc's comments?
var tasks = new List<Task>();
followers.ForEach(f =>
{
var key = f.FollowerId;
var task = _cache.ListRightPushAsync(key, value);
task.ContinueWith(t =>
{
if (t.Result > 800) _cache.ListLeftPopAsync(key).Wait();
});
tasks.Add(task);
});
Task.WaitAll(tasks.ToArray());
CreateBatch probably doesn't do what you think it does. What it does is defer a set of operations and ensure they get sent contiguously relative to a single connection - there are some occasions this is useful, but not all that common - I'd probably just send them individually if it was me. There is also CreateTransaction (MULTI/EXEC), but I don't think that would be a good choice here.
That depends on whether you care about the data you're popping. If not: I'd send a LTRIM, [L|R]PUSH pair - to trim the list to (max-1) before adding. Another option would be Lua, but it seems overkill. If you care about the old data, you'll need to do a range query too.

Resources