Clean Up Azure Machine Learning Blob Storage - azure-blob-storage

I manage a frequently used Azure Machine Learning workspace. With several Experiments and active pipelines. Everything is working good so far. My problem is to get rid of old data from runs, experiments and pipelines. Over the last year the blob storage grew to enourmus size, because every pipeline data is stored.
I have deleted older runs from experimnents by using the gui, but the actual pipeline data on the blob store is not deleted. Is there a smart way to clean up data on the blob store from runs which have been deleted ?
On one of the countless Microsoft support pages, I found the following not very helpfull post:
*Azure does not automatically delete intermediate data written with OutputFileDatasetConfig. To avoid storage charges for large amounts of unneeded data, you should either:
Programmatically delete intermediate data at the end of a pipeline
run, when it is no longer needed
Use blob storage with a short-term storage policy for intermediate data (see Optimize costs by automating Azure Blob Storage access tiers)
Regularly review and delete no-longer-needed data*
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#delete-outputfiledatasetconfig-contents-when-no-longer-needed
Any idea is welcome.

Have you tried applying an azure storage account management policy on the said storage account ?
You could either change the tier of the blob from hot -> cold -> archive and thereby reduce costs or even configure a auto delete policy after a set number of days
Reference : https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview#sample-rule
If you use terraform to manage your resources this should be available a
Reference : https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy
resource "azurerm_storage_management_policy" "example" {
storage_account_id = "<azureml-storage-account-id>"
rule {
name = "rule2"
enabled = false
filters {
prefix_match = ["pipeline"]
}
actions {
base_blob {
delete_after_days_since_modification_greater_than = 90
}
}
}
}
Similar option is available via the portal settings as well.
Hope this helps!

Currently facing this exact problem. The most sensible approach is to enforce retention schedules at the storage account level. These are the steps you can follow:
Identify which storage account is linked to your AML instance and pull it up in the azure portal.
Under Settings / Configuration, ensure you are using StorageV2 (which has the desired functionality)
Under Data management / Lifecycle management, create a new rule that targets your problem containers.
NOTE - I do not recommend a blanket enforcement policy against the entire storage account, because any registered datasets, models, compute info, notebooks, etc will all be target for deletion as well. Instead, use the prefix arguments to declare relevant paths such as: storageaccount1234 / azureml / ExperimentRun
Here is the documentation on Lifecycle management:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal

Related

Azure Python SDK: How to delete a list of resources that have interdependencies?

I have some code that uses the Python Azure SDK to deploy a virtual machine within a resource group. I manually provision each resource in order (a vnet and subnet if necessary, a public IP address, a NIC, and finally the VM itself).
Now, when I want to delete the VM, I can query the list of resources within the resource group and filter that list in my code to match only those resources which have a tag with the matching value.
The problem is that you can't just arbitrarily delete resources that have dependencies. For example, I cannot delete the NIC because it is in use by the virtual machine; I can't delete the OS disk because it's also in use by the VM; I can't delete the public IP address because it's assigned to the NIC; etc.
In the Azure portal you can check off a list of resources and ask the portal to delete all of them, and it handles any resource inter-dependencies for you, but it looks like this is not possible from the SDK.
Right now my only solution is to be fully aware of the path of resource creation and dependency within my code itself. I have to work backwards - first, search the list for VMs with the right tag, delete them, then search for disks with the tag, delete them, NICs, and so on down the line. But this has a lot of room for error and is not in any way reusable for other types of resources.
The only other alternative I can think of is "try to delete it and handle errors" but there's a lot of ugly edge cases I could see happening here and I'd rather take a less haphazard way of handling this, especially since we're deleting things.
TL;dr: Is there a proper way to take a list of resources and query Azure to determine which other resources depend on them? (This could be done one resource at a time but it would still be best to have it be "generic" - i.e. able to do this for any resource without necessarily knowing that resource's type up front).
The resource group contains other resources as well which are related to the same project (e.g. other VMs, a storage account, etc.) so deleting an entire resource group is NOT an option.
One of the workarounds that you can try is using Azure Powershell and tags. Try adding the tags to the resources that you wanted to delete and then use the below command to delete the resources in bulk.
$resources = az resource list --tag Key=Value| ConvertFrom-Json
foreach ($resource in $resources) {
az resource delete --resource-group $resource.resourceGroup --ids $resource.id --verbose
}
This will delete the resources regardless the location or the resource group where it has been created.

Azure cognitive search indexer blob storage

I am stuck in a complicated situation and appreciate that if somebody can help.
So I was testing indexing blob storage( pdf files) and indexed a copy of my storage in qa environment that cost me some money.
My question is that:
Is there any solution to use this index in production without indexing again?
I found a solution to copy the index and that works fine but when I add an indexer that is connect to production blob storage it start indexing from scratch again( as I expected). Is there any solution to avid this? Is there any solution to ask indexer to index from now on?
I tried to use the index and the indexer that I already have by changing the subscription to prod. But I have to change the data source for indexer to point at production blob storage and in this case I get an error :
Indexer 'filesIndexer' currently references data source 'qafilesds' and cannot be updated to reference a different datasource 'prodfilesds' because it has a non-empty change tracking state, or it is currently in progress. You can use Reset API to reset the indexer's change tracking state when it is no longer in progress, and retry this call.
A simple answer to your first question is to simply use the qa index you built.
A more complicated answer is to switch from the push model you are using now to a pull model. From your explanation above I assume all of your content comes from blob storage. And you have configured an indexer to do the indexing for you. This is known as the pull model.
The alternative is to use the Azure Cognitive Search SDK to write your own application that submits content to the index instead. In this case you do not use the built-in indexer, only the index itself. Then you are free to use whatever logic you want to determine what to index and what to skip. You can even enable your storage accounts to notify your application with events when content is updated.

Azure Blob Storage lifecycle management - send report or log after run

I am considering using Azure Blob Storage's build-in lifecycle management feature for deleting blobs of a certain age.
However, due to a business requirement, it must be possible to generate a report or log statement after each daily execution of the defined ruleset. The report or log must state the number of blob blocks that were affected, e.g. deleted during the run.
I have read through the documentation and Googled to see if others have had similar inquiries, but so far without any luck.
So my question: Does any of you know if and how I can get a build-in Lifecycle management system to do one of the following after each daily run:
Add a log statement to the storage account containing the Blob storage.
Generate and send a report to an endpoint I define.
If the above can't be done I will have to code the daily deletion job and report generation myself, which surely I can do, but I would like to use the built-in feature if possible.
I summarize the solution as below.
If you want to know which blobs are deleted every day, we can configure Diagnostics settings in the storqge account. After doing that, we will get the logs for read, write, and delete requests for the blob. For more detail, please refer to here and here
Regarding how to enable it, we can use PowerShell command Set-AzStorageServiceLoggingProperty.

Retention policy to TFS Code Search Server (Elastic Search)

We have TFS 2017.3 with separate Code Search server.
We have huge TFS DB (about 1.6TB), in the code search server we have 700GB dis space.
After few weeks the disk space running out and the code search not work in the tfs.
After we increase the disk space the search back to work.
How can we make retention policy to delete old code search data (index)? we don't want to increased more the disk space.
Search indexing (Code and Work Item) works in 2 phases:
Bulk Indexing (BI) where the entire code and work item artifacts in all projects/repositories under a Collection are indexed. This is a
time consuming operation and depends on the size of the artifacts
under the collection.
Continuous Indexing (CI) which handles all incremental updates to the artifacts (add/updated/delete) and indexes them. This is
notification based model where the indexer listens to TFS events
and operates based on those event notifications. CI handles almost
all update operations including CRUD operations at
Project/Repository/Collection layer (such as Repository renames,
Project add/deletes, etc.). The operation time for these CI would
depend again on the size of the incremental update. BI always
precedes CI i.e. a CI will never execute on a project/repository
until BI is completed for the same.
How to Clean-up Index Data and Re-index please follow below steps:
Pause Indexing for all collections. Run the following script on TFS
Configuration DB
https://github.com/Microsoft/Code-Search/blob/master/PauseIndexing.ps1
Login to the machine where the Elasticsearch (ES) is running
Stop the ES service
Delete the entire Search Index folder (something like,
C:\TfsData\Search\IndexStore, or wherever you had configured it to
be)
Restart the TFS Job Agent service(s) on the AT machines
Delete the following tables from each of the collection DBs
DELETE FROM [Search].[tbl_IndexingUnit]
DELETE FROM [Search].[tbl_IndexingUnitChangeEvent]
DELETE FROM [Search].[tbl_IndexingUnitChangeEventArchive]
DELETE FROM [Search].[tbl_JobYield]
DELETE FROM [Search].[tbl_TreeStore]
DELETE FROM [Search].[tbl_DisabledFiles]
DELETE FROM [Search].[tbl_ResourceLockTable]
Restart the ES service
Run this script on TFS Configuration DB:
https://github.com/Microsoft/Code-Search/blob/master/ResumeIndexing.ps1
Run this script (pick from the correct TFS release folder) on each of
the collections:
https://github.com/Microsoft/Code-Search/blob/master/TFS_2017Update2/MissingIndexFolderTriggerCollectionIndexing.ps1
Try the last script on a smaller collection first (which has less
number of repositories) so that you can verify that indexing happened
correctly and the results are query-able.
More details please refer this blog in MSDN: Resetting Search Index in Team Foundation Server
I was able to reduce the disk size after deleting the ES folders, reinstalling the code search extension, and sometimes had to run the MissingIndexFolderTriggerCollectionIndexing.ps1.
But - I came to the conclusion that it was not worth doing, the disk size was growing rapidly and reaching the original size, so I did not save anything.
Although Microsoft recommends giving disk space of 35% of the DB, it is not enough for us and we increase the size when the disk is full to the end (currently about 45% of the DB size).
The conclusion - don't touch the ES, if the disk fills up then increase the disk size.

How to give RW permissions on a folder in Windows Azure?

I am deploying an MVC 3.0 web app to Windows Azure. I have an action method that takes a file uploaded by the user and stores it in a folder within my web app.
How could I give RW permissions to that folder to the running process? I read about start up tasks and have a basic understanding, but I wouldn't know,
How to give the permission itself, and
Which running process (user) should I give the permission to.
Many thanks for the help.
EDIT
In addition to #David's answer below, I found this link extremely useful:
https://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/
For local storage, I wouldn't get caught up with granting access permissions to various directories. Instead, take advantage of the storage resources available specifically to your running VM's. With a given instance size, you have local storage available to you ranging from 20GB to almost 2TB (full sizing details here). To take advantage of this space, you'd create local storage resources within your project:
Then, in code, grab a drive letter to that storage:
var storageRoot = RoleEnvironment.GetLocalResource("moreStorage").RootPath;
Now you're free to use that storage. And... none of that requires any startup tasks or granting of permissions.
Now for the caveat: This is storage that's local to each running instance, and isn't shared between instances. Further, it's non-durable - if the disk crashes, the data is gone.
For persistent, durable file storage, Blob Storage is a much better choice, as it's durable (triple-replicated within the datacenter, and geo-replicated to another datacenter) and it's external to your role instances, accessible from any instance (or any app, including your on-premises apps).
Since blob storage is organized by container, and blobs within container, it's fairly straightforward to organize your blobs (and store pretty much anything in a given blob, up to 200GB each). Also, it's trivial to upload/download files to/from blobs, either to file streams or local files (in the storage resources you allocated, as illustrated above).

Resources