Can someone let me know the possible ways to trigger a Databricks notebook? My preferred method is via Azure Data Factory, but my company is sadly reluctant to deploy ADF at this present moment in time.
Basically, I would like my Databricks notebook to be triggered when a blob is uploaded to Blob store. Is that possible?
You can try Auto Loader: Auto Loader supports two modes for detecting new files: directory listing and file notification.
Directory listing: Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage. In Databricks Runtime 9.1 and above, Auto Loader can automatically detect whether files are arriving with lexical ordering to your cloud storage and significantly reduce the amount of API calls it needs to make to detect new files.
File notification: Auto Loader can automatically set up a notification service and queue service that subscribe to file events from the input directory. File notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions for set up.
Refer - https://learn.microsoft.com/en-us/azure/databricks/spark/latest/structured-streaming/auto-loader
Related
Currently in consumption you can specify a new folder in a blob container when you create a new blob.
In Standard you have to use the upload to a blob and I don't see where I can specify the folder path:
In Standard you will get two option for choose an operation one is Build-in and second is Azure
would suggest you to choose Azure Option you will get same list of action as you are getting in consumption
Here in Azure-> create blob action(V2), you will observe same thing as in Consumption
Note : choosing Action Upload a blob Azure Storage from Build-in won't give option for Folder Path.
I manage a frequently used Azure Machine Learning workspace. With several Experiments and active pipelines. Everything is working good so far. My problem is to get rid of old data from runs, experiments and pipelines. Over the last year the blob storage grew to enourmus size, because every pipeline data is stored.
I have deleted older runs from experimnents by using the gui, but the actual pipeline data on the blob store is not deleted. Is there a smart way to clean up data on the blob store from runs which have been deleted ?
On one of the countless Microsoft support pages, I found the following not very helpfull post:
*Azure does not automatically delete intermediate data written with OutputFileDatasetConfig. To avoid storage charges for large amounts of unneeded data, you should either:
Programmatically delete intermediate data at the end of a pipeline
run, when it is no longer needed
Use blob storage with a short-term storage policy for intermediate data (see Optimize costs by automating Azure Blob Storage access tiers)
Regularly review and delete no-longer-needed data*
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-move-data-in-out-of-pipelines#delete-outputfiledatasetconfig-contents-when-no-longer-needed
Any idea is welcome.
Have you tried applying an azure storage account management policy on the said storage account ?
You could either change the tier of the blob from hot -> cold -> archive and thereby reduce costs or even configure a auto delete policy after a set number of days
Reference : https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview#sample-rule
If you use terraform to manage your resources this should be available a
Reference : https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy
resource "azurerm_storage_management_policy" "example" {
storage_account_id = "<azureml-storage-account-id>"
rule {
name = "rule2"
enabled = false
filters {
prefix_match = ["pipeline"]
}
actions {
base_blob {
delete_after_days_since_modification_greater_than = 90
}
}
}
}
Similar option is available via the portal settings as well.
Hope this helps!
Currently facing this exact problem. The most sensible approach is to enforce retention schedules at the storage account level. These are the steps you can follow:
Identify which storage account is linked to your AML instance and pull it up in the azure portal.
Under Settings / Configuration, ensure you are using StorageV2 (which has the desired functionality)
Under Data management / Lifecycle management, create a new rule that targets your problem containers.
NOTE - I do not recommend a blanket enforcement policy against the entire storage account, because any registered datasets, models, compute info, notebooks, etc will all be target for deletion as well. Instead, use the prefix arguments to declare relevant paths such as: storageaccount1234 / azureml / ExperimentRun
Here is the documentation on Lifecycle management:
https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview?tabs=azure-portal
I am deploying an MVC 3.0 web app to Windows Azure. I have an action method that takes a file uploaded by the user and stores it in a folder within my web app.
How could I give RW permissions to that folder to the running process? I read about start up tasks and have a basic understanding, but I wouldn't know,
How to give the permission itself, and
Which running process (user) should I give the permission to.
Many thanks for the help.
EDIT
In addition to #David's answer below, I found this link extremely useful:
https://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/
For local storage, I wouldn't get caught up with granting access permissions to various directories. Instead, take advantage of the storage resources available specifically to your running VM's. With a given instance size, you have local storage available to you ranging from 20GB to almost 2TB (full sizing details here). To take advantage of this space, you'd create local storage resources within your project:
Then, in code, grab a drive letter to that storage:
var storageRoot = RoleEnvironment.GetLocalResource("moreStorage").RootPath;
Now you're free to use that storage. And... none of that requires any startup tasks or granting of permissions.
Now for the caveat: This is storage that's local to each running instance, and isn't shared between instances. Further, it's non-durable - if the disk crashes, the data is gone.
For persistent, durable file storage, Blob Storage is a much better choice, as it's durable (triple-replicated within the datacenter, and geo-replicated to another datacenter) and it's external to your role instances, accessible from any instance (or any app, including your on-premises apps).
Since blob storage is organized by container, and blobs within container, it's fairly straightforward to organize your blobs (and store pretty much anything in a given blob, up to 200GB each). Also, it's trivial to upload/download files to/from blobs, either to file streams or local files (in the storage resources you allocated, as illustrated above).
I have 10 applications they have same logic to write the log on a text file located on the application root folder.
I have an application which reads the log files of all the applicaiton and shows details in a web page.
Can the same be achieved on Windows Azure? I don't want to use the 'DiagnosticMonitor' API's. As I cannot change logging logic of application.
Thanks,
Aman
Even if technically this is possible, this is not advisable as the Fabric Controller can re-create any role at a whim (well - with good reasons, but unpredictable none-the-less) and so whenever this happens you will lose any files stored locally on a role.
So - primarily you should be looking for a different place to store those logs, and there are many options, but all require that you change the logging logic of the application.
You could do this, but aside from the issue Yossi pointed out (the log would be ephemeral; it could get deleted at any time), you'd have a different log file on each role instance (VM). That means when you hit your web page to view the log, you'd see whatever happened to be on the log on that particular VM, instead of what you presumably want (a roll-up of the log files across all VMs).
Windows Azure Diagnostics could help, since you can configure it to copy log files off to blob storage (so no need to change the logging). But honestly I find Diagnostics a bit cumbersome for this. It will end up creating a lot of different blobs, and you'll have to change the log viewer to read all those blobs and combine them.
I personally would suggest writing a separate piece of code that monitors the log file and, for each new line, stores the line as an entity (row) in table storage. This bit of code could be launched as a startup task and just run continuously as a separate process (leaving everything else unchanged). Then modify the log viewer to read the last n entities from table storage and display them.
(I'm assuming you can modify the log viewer even if you can't modify the apps that log to the file.)
What about writing logs to something like azure storage table? Just need to define unique ParitionKey/RowKey, then you can easily retrieve the log for the web page.
In my WP 7 App, i have to store the images and XML file of two types,
1: first type of files are not updated frequently on server so i want to store them Permanently on local storage so that when ever app starts it can access these files from local storage , and when these files are updated on server , also update local storage files.I want these files not to be deleted on application termination.
2: Second type of files are those that i want to save in isolated storage temporarily e.g. app requested a XML file from server , i stored it locally and next time if app requests same file instead of getting it from server get it from local storage , and Delete these files when the application terminates..
How can i do this ?
Thanks
1) Isolated Storage is designed to be used to store data that should remain permanent (until the user uninstalls the app). There's example code of how to write and save a file on MSDN. Therefore, any file you save (temp or not), will be stored until the user uninstalls the app or your app deletes the file.
2) For temporary data, you can use the PhoneApplicationState property. This will automatically delete the files after your app closes. However, there's a size limit (I belive PhoneApplicationService.State has a limit of 4mb).
Alternatively, if the XML file is too big, you can write it to the Isolated Storage. Then, you can handle your page's Closing event and delete the file from Isolated Storage there using the DeleteFile method.