Azure cognitive search indexer blob storage - azure-blob-storage

I am stuck in a complicated situation and appreciate that if somebody can help.
So I was testing indexing blob storage( pdf files) and indexed a copy of my storage in qa environment that cost me some money.
My question is that:
Is there any solution to use this index in production without indexing again?
I found a solution to copy the index and that works fine but when I add an indexer that is connect to production blob storage it start indexing from scratch again( as I expected). Is there any solution to avid this? Is there any solution to ask indexer to index from now on?
I tried to use the index and the indexer that I already have by changing the subscription to prod. But I have to change the data source for indexer to point at production blob storage and in this case I get an error :
Indexer 'filesIndexer' currently references data source 'qafilesds' and cannot be updated to reference a different datasource 'prodfilesds' because it has a non-empty change tracking state, or it is currently in progress. You can use Reset API to reset the indexer's change tracking state when it is no longer in progress, and retry this call.

A simple answer to your first question is to simply use the qa index you built.
A more complicated answer is to switch from the push model you are using now to a pull model. From your explanation above I assume all of your content comes from blob storage. And you have configured an indexer to do the indexing for you. This is known as the pull model.
The alternative is to use the Azure Cognitive Search SDK to write your own application that submits content to the index instead. In this case you do not use the built-in indexer, only the index itself. Then you are free to use whatever logic you want to determine what to index and what to skip. You can even enable your storage accounts to notify your application with events when content is updated.

Related

Clean Up Azure Machine Learning Blob Storage

I manage a frequently used Azure Machine Learning workspace. With several Experiments and active pipelines. Everything is working good so far. My problem is to get rid of old data from runs, experiments and pipelines. Over the last year the blob storage grew to enourmus size, because every pipeline data is stored.
I have deleted older runs from experimnents by using the gui, but the actual pipeline data on the blob store is not deleted. Is there a smart way to clean up data on the blob store from runs which have been deleted ?
On one of the countless Microsoft support pages, I found the following not very helpfull post:
*Azure does not automatically delete intermediate data written with OutputFileDatasetConfig. To avoid storage charges for large amounts of unneeded data, you should either:
Programmatically delete intermediate data at the end of a pipeline
run, when it is no longer needed
Use blob storage with a short-term storage policy for intermediate data (see Optimize costs by automating Azure Blob Storage access tiers)
Regularly review and delete no-longer-needed data*
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#delete-outputfiledatasetconfig-contents-when-no-longer-needed
Any idea is welcome.
Have you tried applying an azure storage account management policy on the said storage account ?
You could either change the tier of the blob from hot -> cold -> archive and thereby reduce costs or even configure a auto delete policy after a set number of days
Reference : https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview#sample-rule
If you use terraform to manage your resources this should be available a
Reference : https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy
resource "azurerm_storage_management_policy" "example" {
storage_account_id = "<azureml-storage-account-id>"
rule {
name = "rule2"
enabled = false
filters {
prefix_match = ["pipeline"]
}
actions {
base_blob {
delete_after_days_since_modification_greater_than = 90
}
}
}
}
Similar option is available via the portal settings as well.
Hope this helps!
Currently facing this exact problem. The most sensible approach is to enforce retention schedules at the storage account level. These are the steps you can follow:
Identify which storage account is linked to your AML instance and pull it up in the azure portal.
Under Settings / Configuration, ensure you are using StorageV2 (which has the desired functionality)
Under Data management / Lifecycle management, create a new rule that targets your problem containers.
NOTE - I do not recommend a blanket enforcement policy against the entire storage account, because any registered datasets, models, compute info, notebooks, etc will all be target for deletion as well. Instead, use the prefix arguments to declare relevant paths such as: storageaccount1234 / azureml / ExperimentRun
Here is the documentation on Lifecycle management:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal

Azure Blob Storage lifecycle management - send report or log after run

I am considering using Azure Blob Storage's build-in lifecycle management feature for deleting blobs of a certain age.
However, due to a business requirement, it must be possible to generate a report or log statement after each daily execution of the defined ruleset. The report or log must state the number of blob blocks that were affected, e.g. deleted during the run.
I have read through the documentation and Googled to see if others have had similar inquiries, but so far without any luck.
So my question: Does any of you know if and how I can get a build-in Lifecycle management system to do one of the following after each daily run:
Add a log statement to the storage account containing the Blob storage.
Generate and send a report to an endpoint I define.
If the above can't be done I will have to code the daily deletion job and report generation myself, which surely I can do, but I would like to use the built-in feature if possible.
I summarize the solution as below.
If you want to know which blobs are deleted every day, we can configure Diagnostics settings in the storqge account. After doing that, we will get the logs for read, write, and delete requests for the blob. For more detail, please refer to here and here
Regarding how to enable it, we can use PowerShell command Set-AzStorageServiceLoggingProperty.

Store infrequently changing info in Spring App

I am working On a Microservice (Spring boot) that require to store some static information that infrequently changes (once per quarter). The data (below) is about the company reports that looks like
reportId#1: "frequency"="daily","to":"some email ids"
reportId#2: "frequency"="weekly", "to":"some emailids"
As you can see an entry in the data is basically a Report id, and associated attributes are frequency of reports and receiver's email id.
My question is.. What is the best place to store this information? I have some thoughts..and here are my views.
a) NoSQL DB like MongoDB seems to be a good option.. I can create a Collection and store it there and retrieve it once during app startup. But the I thought, whether creating a Collection just to store this static info is a good choice?
b) Redis seems to be another good option. I can create a template for above dataset and store it there. I can query the Redis based on the reportId to retrieve the frequency and senders list.
c) Store it in a file in the classpath and load at the app startup. The downside is that, I will have to redeploy the app with new changes in file whenever this report listing changes. I believe externalizing this information to either Mongo or Redis is a better option.
d) The app is running in the AWS..so I can even store this in a file in S3 bucket.
Would like to know your views?
Since the config will only change once a quarter, the overheard of a database is not required. You should consider Apache commons configuration. It will allow you to load config changes from files without the need for an application restart.
http://commons.apache.org/proper/commons-configuration///userguide/howto_reloading.html

Insert Read-Only Document in Couchbase

Is there any way to Insert Read-Only document or Key-Value pair in Couchbase using couchbase Go SDK?
There isn't a way to do this at the document level (yet), but one possible workaround with Couchbase Server Enterprise is bucket level permission. You could create a bucket (e.g. "myreadonly") and create a user (e.g "myreadonlyuser") that only has data reader permission. Of course, someone will need write access to put the document in there in the first place, but anyone using the "myreadonlyuser" credentials can only read.
There might be a way to do it in the upcoming "scope" and "collection" levels too, but it would likely be a variation of the above approach. Document-level authentication may be on the roadmap for the future.

Retention policy to TFS Code Search Server (Elastic Search)

We have TFS 2017.3 with separate Code Search server.
We have huge TFS DB (about 1.6TB), in the code search server we have 700GB dis space.
After few weeks the disk space running out and the code search not work in the tfs.
After we increase the disk space the search back to work.
How can we make retention policy to delete old code search data (index)? we don't want to increased more the disk space.
Search indexing (Code and Work Item) works in 2 phases:
Bulk Indexing (BI) where the entire code and work item artifacts in all projects/repositories under a Collection are indexed. This is a
time consuming operation and depends on the size of the artifacts
under the collection.
Continuous Indexing (CI) which handles all incremental updates to the artifacts (add/updated/delete) and indexes them. This is
notification based model where the indexer listens to TFS events
and operates based on those event notifications. CI handles almost
all update operations including CRUD operations at
Project/Repository/Collection layer (such as Repository renames,
Project add/deletes, etc.). The operation time for these CI would
depend again on the size of the incremental update. BI always
precedes CI i.e. a CI will never execute on a project/repository
until BI is completed for the same.
How to Clean-up Index Data and Re-index please follow below steps:
Pause Indexing for all collections. Run the following script on TFS
Configuration DB
https://github.com/Microsoft/Code-Search/blob/master/PauseIndexing.ps1
Login to the machine where the Elasticsearch (ES) is running
Stop the ES service
Delete the entire Search Index folder (something like,
C:\TfsData\Search\IndexStore, or wherever you had configured it to
be)
Restart the TFS Job Agent service(s) on the AT machines
Delete the following tables from each of the collection DBs
DELETE FROM [Search].[tbl_IndexingUnit]
DELETE FROM [Search].[tbl_IndexingUnitChangeEvent]
DELETE FROM [Search].[tbl_IndexingUnitChangeEventArchive]
DELETE FROM [Search].[tbl_JobYield]
DELETE FROM [Search].[tbl_TreeStore]
DELETE FROM [Search].[tbl_DisabledFiles]
DELETE FROM [Search].[tbl_ResourceLockTable]
Restart the ES service
Run this script on TFS Configuration DB:
https://github.com/Microsoft/Code-Search/blob/master/ResumeIndexing.ps1
Run this script (pick from the correct TFS release folder) on each of
the collections:
https://github.com/Microsoft/Code-Search/blob/master/TFS_2017Update2/MissingIndexFolderTriggerCollectionIndexing.ps1
Try the last script on a smaller collection first (which has less
number of repositories) so that you can verify that indexing happened
correctly and the results are query-able.
More details please refer this blog in MSDN: Resetting Search Index in Team Foundation Server
I was able to reduce the disk size after deleting the ES folders, reinstalling the code search extension, and sometimes had to run the MissingIndexFolderTriggerCollectionIndexing.ps1.
But - I came to the conclusion that it was not worth doing, the disk size was growing rapidly and reaching the original size, so I did not save anything.
Although Microsoft recommends giving disk space of 35% of the DB, it is not enough for us and we increase the size when the disk is full to the end (currently about 45% of the DB size).
The conclusion - don't touch the ES, if the disk fills up then increase the disk size.

Resources