Retention policy to TFS Code Search Server (Elastic Search) - elasticsearch

We have TFS 2017.3 with separate Code Search server.
We have huge TFS DB (about 1.6TB), in the code search server we have 700GB dis space.
After few weeks the disk space running out and the code search not work in the tfs.
After we increase the disk space the search back to work.
How can we make retention policy to delete old code search data (index)? we don't want to increased more the disk space.

Search indexing (Code and Work Item) works in 2 phases:
Bulk Indexing (BI) where the entire code and work item artifacts in all projects/repositories under a Collection are indexed. This is a
time consuming operation and depends on the size of the artifacts
under the collection.
Continuous Indexing (CI) which handles all incremental updates to the artifacts (add/updated/delete) and indexes them. This is
notification based model where the indexer listens to TFS events
and operates based on those event notifications. CI handles almost
all update operations including CRUD operations at
Project/Repository/Collection layer (such as Repository renames,
Project add/deletes, etc.). The operation time for these CI would
depend again on the size of the incremental update. BI always
precedes CI i.e. a CI will never execute on a project/repository
until BI is completed for the same.
How to Clean-up Index Data and Re-index please follow below steps:
Pause Indexing for all collections. Run the following script on TFS
Configuration DB
https://github.com/Microsoft/Code-Search/blob/master/PauseIndexing.ps1
Login to the machine where the Elasticsearch (ES) is running
Stop the ES service
Delete the entire Search Index folder (something like,
C:\TfsData\Search\IndexStore, or wherever you had configured it to
be)
Restart the TFS Job Agent service(s) on the AT machines
Delete the following tables from each of the collection DBs
DELETE FROM [Search].[tbl_IndexingUnit]
DELETE FROM [Search].[tbl_IndexingUnitChangeEvent]
DELETE FROM [Search].[tbl_IndexingUnitChangeEventArchive]
DELETE FROM [Search].[tbl_JobYield]
DELETE FROM [Search].[tbl_TreeStore]
DELETE FROM [Search].[tbl_DisabledFiles]
DELETE FROM [Search].[tbl_ResourceLockTable]
Restart the ES service
Run this script on TFS Configuration DB:
https://github.com/Microsoft/Code-Search/blob/master/ResumeIndexing.ps1
Run this script (pick from the correct TFS release folder) on each of
the collections:
https://github.com/Microsoft/Code-Search/blob/master/TFS_2017Update2/MissingIndexFolderTriggerCollectionIndexing.ps1
Try the last script on a smaller collection first (which has less
number of repositories) so that you can verify that indexing happened
correctly and the results are query-able.
More details please refer this blog in MSDN: Resetting Search Index in Team Foundation Server

I was able to reduce the disk size after deleting the ES folders, reinstalling the code search extension, and sometimes had to run the MissingIndexFolderTriggerCollectionIndexing.ps1.
But - I came to the conclusion that it was not worth doing, the disk size was growing rapidly and reaching the original size, so I did not save anything.
Although Microsoft recommends giving disk space of 35% of the DB, it is not enough for us and we increase the size when the disk is full to the end (currently about 45% of the DB size).
The conclusion - don't touch the ES, if the disk fills up then increase the disk size.

Related

Will reloading all TDE templates tigger reindexing cause ML Performance issue?

Now I am using gradle mlReloadSchemas tasks to reload TDE templates.
I guess even if the change is for one tde file only, the reload schemas task may delete all in DB and load all TDE templates to ML DB.
I wonder whether it will cause a performance issue for ML. Will that trigger indexing even for the TDE files that have not yet changed?
I am using DevOps pipeline to trigger the schema reload from GIT repository. As such, I could not load only the change TDE file. I have to reload everything. If there is performance issue, how to load only changed file with the pipeline?
Redeploying TDE can cause reindexing. How many records to be reindexed depends upon the context matching for those TDE.
A properly resourced cluster should be able to handle the load of reindexing.
That being said, the merging activities can compete with online traffic and query demands. You can help minimize the impact by setting the reindex throttle to a lower level (1-5 with 1 being the lowest), and you can set a background-io limit to restrict the amount of IO any node will use for background activities such as merges and backups.
You can also choose when to enable/disable reindexing, and adjust the reindexing level to a higher/lower level at different periods.
https://help.marklogic.com/Knowledgebase/Article/View/how-reindexing-works-and-its-impact-on-performance
https://help.marklogic.com/Knowledgebase/Article/View/indexing-best-practices

Clean Up Azure Machine Learning Blob Storage

I manage a frequently used Azure Machine Learning workspace. With several Experiments and active pipelines. Everything is working good so far. My problem is to get rid of old data from runs, experiments and pipelines. Over the last year the blob storage grew to enourmus size, because every pipeline data is stored.
I have deleted older runs from experimnents by using the gui, but the actual pipeline data on the blob store is not deleted. Is there a smart way to clean up data on the blob store from runs which have been deleted ?
On one of the countless Microsoft support pages, I found the following not very helpfull post:
*Azure does not automatically delete intermediate data written with OutputFileDatasetConfig. To avoid storage charges for large amounts of unneeded data, you should either:
Programmatically delete intermediate data at the end of a pipeline
run, when it is no longer needed
Use blob storage with a short-term storage policy for intermediate data (see Optimize costs by automating Azure Blob Storage access tiers)
Regularly review and delete no-longer-needed data*
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#delete-outputfiledatasetconfig-contents-when-no-longer-needed
Any idea is welcome.
Have you tried applying an azure storage account management policy on the said storage account ?
You could either change the tier of the blob from hot -> cold -> archive and thereby reduce costs or even configure a auto delete policy after a set number of days
Reference : https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview#sample-rule
If you use terraform to manage your resources this should be available a
Reference : https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy
resource "azurerm_storage_management_policy" "example" {
storage_account_id = "<azureml-storage-account-id>"
rule {
name = "rule2"
enabled = false
filters {
prefix_match = ["pipeline"]
}
actions {
base_blob {
delete_after_days_since_modification_greater_than = 90
}
}
}
}
Similar option is available via the portal settings as well.
Hope this helps!
Currently facing this exact problem. The most sensible approach is to enforce retention schedules at the storage account level. These are the steps you can follow:
Identify which storage account is linked to your AML instance and pull it up in the azure portal.
Under Settings / Configuration, ensure you are using StorageV2 (which has the desired functionality)
Under Data management / Lifecycle management, create a new rule that targets your problem containers.
NOTE - I do not recommend a blanket enforcement policy against the entire storage account, because any registered datasets, models, compute info, notebooks, etc will all be target for deletion as well. Instead, use the prefix arguments to declare relevant paths such as: storageaccount1234 / azureml / ExperimentRun
Here is the documentation on Lifecycle management:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal

Azure cognitive search indexer blob storage

I am stuck in a complicated situation and appreciate that if somebody can help.
So I was testing indexing blob storage( pdf files) and indexed a copy of my storage in qa environment that cost me some money.
My question is that:
Is there any solution to use this index in production without indexing again?
I found a solution to copy the index and that works fine but when I add an indexer that is connect to production blob storage it start indexing from scratch again( as I expected). Is there any solution to avid this? Is there any solution to ask indexer to index from now on?
I tried to use the index and the indexer that I already have by changing the subscription to prod. But I have to change the data source for indexer to point at production blob storage and in this case I get an error :
Indexer 'filesIndexer' currently references data source 'qafilesds' and cannot be updated to reference a different datasource 'prodfilesds' because it has a non-empty change tracking state, or it is currently in progress. You can use Reset API to reset the indexer's change tracking state when it is no longer in progress, and retry this call.
A simple answer to your first question is to simply use the qa index you built.
A more complicated answer is to switch from the push model you are using now to a pull model. From your explanation above I assume all of your content comes from blob storage. And you have configured an indexer to do the indexing for you. This is known as the pull model.
The alternative is to use the Azure Cognitive Search SDK to write your own application that submits content to the index instead. In this case you do not use the built-in indexer, only the index itself. Then you are free to use whatever logic you want to determine what to index and what to skip. You can even enable your storage accounts to notify your application with events when content is updated.

Magento reindex does not populate solr search

We have solr installed as a tomcat application. With the default installation of EE magento with sample data all runs perfect.
However, with a couple of our stores running the index does not touch the /solr/data folder. Normally when a reindex is called all files in /data are cleared and repopulated. However, on the live site this does not happen.
I have stipped back a lot of extesnions and run the index again.. this time the files are recreated but almost no data is captured... even less than the sample stores despite a much larger database.
Anyone any clues as to where we should be looking?

MySQL database backup: performance issues

Folks,
I'm trying to set up a regular backup of a rather large production database (half a gig) that has both InnoDB and MyISAM tables. I've been using mysqldump so far, but I find that it's taking increasingly longer periods of time, and the server is completely unresponsive while mysqldump is running.
I wanted to ask for your advice: how do I either
Make mysqldump backup non-blocking - assign low priority to the process or something like that, OR
Find another backup mechanism that will be better/faster/non-blocking.
I know of the existence of MySQL Enterprise Backup product (http://www.mysql.com/products/enterprise/backup.html) - it's expensive and this is not an option for this project.
I've read about setting up a second server as a "replication slave", but that's not an option for me either (this requires hardware, which costs $$).
Thank you!
UPDATE: more info on my environment: Ubuntu, latest LAMPP, Amazon EC2.
If replication to a slave isn't an option, you could leverage the filesystem, depending on the OS you're using,
Consistent backup with Linux Logical Volume Manager (LVM) snapshots.
MySQL backups using ZFS snapshots.
The joys of backing up MySQL with ZFS...
I've used ZFS snapshots on a quite large MySQL database (30GB+) as a backup method and it completes very quickly (never more than a few minutes) and doesn't block. You can then mount the snapshot somewhere else and back it up to tape, etc.
Edit: (previous answer was suggestion a slave db to back up from, then I noticed Alex ruled that out in his question.)
There's no reason your replication slave can't run on the same hardware, assuming the hardware can keep up. Grab a source tarball, ./configure --prefix=/dbslave; make; make install; and you'll have a second mysql server living completely under /dbslave.
EDIT2: Replication has a bunch of other benefits, as well. For instance, with replication running, you'll may be able to recover the binlog and replay it on top your last backup to recover the extra data after certain kinds of catastrophes.
EDIT3: You mention you're running on EC2. Another, somewhat contrived idea to keep costs down is to try setting up another instance with an EBS volume. Then use the AWS api to spin this instance up long enough for it to catch up with writes from the binary log, dump/compress/send the snapshot, and then spin it down. Not free, and labor-intensive to set up, but considerably cheaper than running the instance 24x7.
Try mk-parallel-dump utility from maatkit (http://www.maatkit.org/)
regards,
Something you might consider is using binary logs here though a method called 'log shipping'. Just before every backup, issue out a command to flush the binary logs and then you can copy all except the current binary log out via your regular file system operations.
The advantage with this method is your not locking up the database at all, since when it opens up the next binary log in sequence, it releases all the file locks on the prior logs so processing shouldn't be affected then. Tar'em, zip'em in place, do as you please, then copy it out as one file to your backup system.
An another advantage with using binary logs is you can restore up to X point in time if the logs are available. I.e. You have last year's full backup, and every log from then to now. But you want to see what the database was on Jan 1st, 2011. You can issue a restore 'until 2011-01-01' and when it stops, your at Jan 1st, 2011 as far as the database is concerned.
I've had to use this once to reverse the damage a hacker caused.
It is definately worth checking out.
Please note... binary logs are USUALLY used for replication. Nothing says you HAVE to.
Adding to what Rich Adams and timdev have already suggested, write a cron job which gets triggered on low usage period to perform the slaving task as suggested to avoid high CPU utilization.
Check mysql-parallel-dump also.

Resources