Downloading data from azure storage explorer using dvc - azure-blob-storage

I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer.
Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url?
I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push .
And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format?
If it is not possible is there then another way to download data automatically from azure?

I'll build up on #Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
We're using a DVC pipeline, specified in an existing dvc.yaml file. The first stage in the current pipeline is called prepare.
Our data is stored on some Azure blob storage container, in a folder named dataset/. This folder follows a structure of sub-folders that we'd like to keep intact.
The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name myazure (more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.
This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.
Implementation
This procedure has been tested on Windows 10.
Add a simple update_dataset stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.
Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go.
This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.
To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy command, it uses azcopy sync.
The advantage of azcopy sync is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/ folder - i.e. the -o dataset part - as it was causing the azcopy command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy command, I've included the --delete-destination="true" option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset folder if deleted on the Azure container.

Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format? :). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url - it'll download objects, it'll cache them, and will by default push (dvc push) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file .dvc that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".
dvc get-url - this more or less wget or rclone or aws s3 cp, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
Similar to import-url but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and
filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.

Related

How to change or specify a DVC experiment name?

How do I change the name of the experiment? I tried to use dvc exp run -n to name the project then use git to push to github. However the experiment name is still SHA.
Tried: I tried to use dvc exp run -n to name the project then use git to push to github. However the experiment name is still SHA.
Expected experiment name on iterative studio interface to display the name
Actually happened: Github SHA value instead of the name
There are two types of experiments in DVC ecosystem that we need to distinguish and there are a few different approaches on naming them.
First, is what we sometimes call "ephemeral" experiments, those that do not create commits in your Git history unless you explicitly say so. They are described in this Get Started section. For each of those experiments a name is auto-generated (e.g. angry-upas in one of the examples from the doc) or you could use dvc exp run -n ... to pass a particular name to it.
Another way to create those experiments (+ send them to Studio) is to use DVC logger (DVCLive), e.g. it's described here. Those experiments will be visible in Studio with an auto-generated names (or a name that was provided when they were created).
Now, we have another type of an experiment - a commit. Something that someone decided to make persistent and share with the team via PR and/or via a regular commit. Those are presented in the image that is shared in the Question.
Since they are regular commits, the regular Git rules apply to them - they have hash, they have descriptions, and most relevant to this discussion is that they have tags. All this information will be reflected in Studio UI.
E.g. in this public example-get-started repo:
It think in your case, tags would be the most natural way to rename those. May be we can introduce a way to push exp name as a git tag along with the experiment when it's being saved. WDYT?
Let me know if that answers your question.

How to correctly dockerize and continuously integrate 20GB raw data?

I have an application that uses about 20GB of raw data. The raw data consists of binaries.
The files rarely - if ever - change. Changes only happen if there are errors within the files that need to be resolved.
The most simple way to handle this would be to put the files in its own git repository and create a base image based on that. Then build the application on top of the raw data image.
Having a 20GB base image for a CI pipeline is not something I have tried and does not seem to be the optimal way to handle this situation.
The main reason for my approach here ist to prevent extra deployment complexity.
Is there a best practice, "correct" or more sensible way to do this?
Huge mostly-static data blocks like this are probably the one big exception to me to the “Docker images should be self-contained” rule. I’d suggest keeping this data somewhere else, and download it separately from the core docker run workflow.
I have had trouble in the past with multi-gigabyte images. Operations like docker push and docker pull in particular are prone to hanging up on the second gigabyte of individual layers. If, as you say, this static content changes rarely, there’s also a question of where to put it in the linear sequence of layers. It’s tempting to write something like
FROM ubuntu:18.04
ADD really-big-content.tar.gz /data
...
But even the ubuntu:18.04 image changes regularly (it gets security updates fairly frequently; your CI pipeline should explicitly docker pull it) and when it does a new build will have to transfer this entire unchanged 20 GB block again.
Instead I would put them somewhere like an AWS S3 bucket or similar object storage. (This is a poor match for source control systems, which (a) want to keep old content forever and (b) tend to be optimized for text rather than binary files.). Then I’d have a script that runs on the host that downloads that content, and then mount the corresponding host directory into the containers that need it.
curl -LO http://downloads.example.com/really-big-content.tar.gz
tar xzf really-big-content.tar.gz
docker run -v $PWD/really-big-content:/data ...
(In Kubernetes or another distributed world, I’d probably need to write a dedicated Job to download the content into a Persistent Volume and run that as part of my cluster bring-up. You could do the same thing in plain Docker to download the content into a named volume.)

Windows Projected File System read only?

I tried to play around with Projected File System to implement a user mode ram drive (previously I had used Dokan). I have two questions:
Is this a read-only projection? I could not find anything any notification sent to me when opening the file from say Notepad and writing to it.
Is the file actually created on the disk once I use PrjWriteFileData()? From what I have understood, yes.
In that case what would be any useful thing that one could do with this library if there is no writing to the projected files? It seems to me that the only useful thing is to initially create a directory tree from somewhere else (say, a remote repo), but nothing beyond that. Dokan still seems the way to go.
The short answer:
It's not read-only but you can't write your files directly to a "source" filesystem via a projected one.
WriteFileData method is used for populating placeholder files on the "scratch" (projected) file system, so, it doesn't affect a "source" file system.
The long answer:
As stated in the comment by #zett42 ProjFS was mainly designed as a remote git file system. So, the main goal of any file versioning system is to handle multiple versions of files. From this a question arise - do we need to override the file inside a remote repository on ProjFS file write? It would be disastrous. When working with git you always write files locally and they are not synced until you push the changes to a remote repository.
When you enumerate files nothing being written to a local file system. From the ProjFS documentation:
When a provider first creates a virtualization root it is empty on the
local system. That is, none of the items in the backing data store
have yet been cached to disk.
Only after the file is opened ProjFS creates a "placeholder" for it in a local file system - I assume that it's a file with a special structure (not a real one).
As files and directories under the virtualization root are opened, the
provider creates placeholders on disk, and as files are read the
placeholders are hydrated with contents.
What "hydrated" is mean? Most likely, it represents a special data structure partially filled with real data. I would imaginge a placeholder as a sponge partially filled with data.
As items are opened, ProjFS requests information from the provider to allow placeholders for those items to be created in the local file system. As item contents are accessed, ProjFS requests those contents from the provider. The result is that from the user's perspective, virtualized files and directories appear similar to normal files and directories that already reside on the local file system.
Only after a file is updated (modified). It's not a placeholder anymore - it becomes "Full file/directory":
For files: The file's content (primary data stream) has been modified.
The file is no longer a cache of its state in the provider's store.
Files that have been created on the local file system (i.e. that do
not exist in the provider's store at all) are also considered to be
full files.
For directories: Directories that have been created on the local file
system (i.e. that do not exist in the provider's store at all) are
considered to be full directories. A directory that was created on
disk as a placeholder never becomes a full directory.
It means that on the first write the placeholder is replaced by the real file in the local FS. But how to keep a "remote" file in sync with a modified one? (1)
When the provider calls PrjWritePlaceholderInfo to write the
placeholder information, it supplies the ContentID in the VersionInfo
member of the placeholderInfo argument. The provider should then
record that a placeholder for that file or directory was created in
this view.
Notice "The provider should then record that a placeholder for that file". It means that in order to sync the file later with a correct view representation we have to remember with which version a modified file is associated. Imagine we are in a git repository and we change the branch. In this case, we may update one file multiple times in different branches. Now, why and when the provider calls PrjWritePlaceholderInfo?
... These placeholders represent the state of the backing store at the
time they were created. These cached items, combined with the items
projected by the provider in enumerations, constitute the client's
"view" of the backing store. From time to time the provider may wish
to update the client's view, whether because of changes in the backing
store, or because of explicit action taken by the user to change their
view.
Once again, imagine switching branches in a git repository; you have to update a file if it's different in another branch. Continuing answering the question (1). Imaging you want to make a "push" from a particular branch. First of all, you have to know which files are modified. If you are not recorded the placeholder info while modifying your file you won't be able to do it correctly (at least for the git repository example).
Remember, that a placeholder is replaced by a real file on modification? A ProjFS has OnNotifyFileHandleClosedFileModifiedOrDeleted event. Here is the signature of the callback:
public void NotifyFileHandleClosedFileModifiedOrDeletedCallback(
string relativePath,
bool isDirectory,
bool isFileModified,
bool isFileDeleted,
uint triggeringProcessId,
string triggeringProcessImageFileName)
For our understanding, the most important parameter for us here is relativePath. It will contain a name of a modified file inside the "scratch" file system (projected). Here you also know that the file is a real file (not a placeholder) and it's written to the disk (that's it you won't be able to intercept the call before the file is written). Now you may copy it to the desired location (or do it later) - it depends on your goals.
Answering the question #2, it seems like PrjWriteFileData is used only for populating "scratch" file system and you cannot use it for updating the "source" file system.
Applications:
As for applications, you still can implement a remote file system (instead of using Dokan) but all writes will be cached locally instead of directly written to a remote location. A couple use case ideas:
Distributed File Systems
Online Drive Client
A File System "Dispatcher" (for example, you may write your files in different folders depending on particular conditions)
A File Versioning System (for example, you may preserve different versions of the same file after a modification)
Mirroring data from your app to a file system (for example, you can "project" a text file with indentations to folders, sub-folders and files)
P.S.: I'm not aware of any undocumented APIs, but from my point of view (accordingly with the documentation) we cannot use ProjFS for purposes like a ramdisk or write files directly to the "source" file system without writing them to the "local" file system first.

Golang file and folder replication / mirroring across multiple servers

Consider this scenario. In a load-balanced environment, I have 3 separate instances of a CMS running on 3 different physical servers. These 3 separate running instances of the application is sharing the same database.
On each server, the CMS has a /media folder where all media subfolders and files reside. My question is how I'd implement/code a file replication service/functionality in Golang, so when a subfolder or file is added/changed/deleted on one of the servers, it'll get copied/replicated/deleted on all other servers?
What packages would I need to look in to, or perhaps you have a small code snippet to help me get started? That would be awesome.
Edit:
This question has been marked as "duplicate", but it is not. It is however an alternative to setting up a shared network file system. I'm thinking that keeping a copy of the same file on all servers, synchronizing and keeping them updated might be better than sharing them.
You probably shouldn't do this. Use a distributed file system, object storage (ala S3 or GCS) or a syncing program like btsync or syncthing.
If you still want to do this yourself, it will be challenging. You are basically building a distributed database and they are difficult to get right.
At first blush you could checkout something like etcd or raft, but unfortunately etcd doesn't work well with large files.
You could, on upload, also copy the file to every other server using ssh. But then what happens when a server goes down? Or what happens when two people update the same file at the same time?
Maybe you could design it such that every file gets a unique id (perhaps based on the hash of its contents so you can safely dedupe) and those files can never be updated or deleted, only added. That would solve the simultaneous update problem, but you'd still have the downtime problem.
One approach would be for each server to maintain an append-only version log when a file is added:
VERSION | FILE HASH
1 | abcd123
2 | efgh456
3 | ijkl789
With that you can pull every file from a server and a single number would be sufficient to know when a file is added. (For example if you think Server A is on version 5, and you get informed it is now on version 7, you know you need to sync 2 files)
You could do this with a database table:
ID | LOCAL_SERVER_ID | REMOTE_SERVER_ID | VERSION | FILE HASH
Which you could periodically poll and do your syncing via ssh or http between machines. If a server was down you could just retry until it works.
Or if you didn't want to have a centralized database for this you could use a library like memberlist. The local meta data for each node could be its version.
Either way there will be some amount of delay between a file was uploaded to a single server, and when it's available on all of them. Handling that well is hard, which is why you probably shouldn't do this.

Is there a version control feature in Oracle BI Answers for a single Analysis?

I built an Analysis that displayed Results, error free. All is well.
Then, I added some filters to existing criteria sets. I also copied an existing criteria set, pasted it, and modified it's filters. When I try to display results, I see a View Display Error.
I’d like to revert back to that earlier functional version of the analyses, hopefully without manually undoing the all of filter & criteria changes I made since then.
If you’ve seen a feature like this, I’d like to hear about it!
Micah-
Great question. There are many times in the past when we wished we had some simple SCM on the Oracle BI Web Catalog. There is currently no "out of the box" source control for the web catalog, but some simple work-arounds do exist.
If you have access server side where the web catalog lives you can start with the following approach.
Oracle BI Web Catalog Version Control Using GIT Server Side with CRON Job:
Make a backup of your web catalog!
Create a GIT Repository in the web cat
base directory where the root dir and root.atr file exist.
Initial commit eveything. ( git add -A; git commit -a -m
"initial commit"; git push )
Setup a CRON job to run a script Hourly,
Minutely, etc that will tell GIT to auto commit any
adds/deletes/modifications to your GIT repository. ( git add -A; git
commit -a -m "auto_commit_$(date +"%Y-%m-%d_%T")"; git push )
Here are the issues with this approach:
If the CRON runs hourly, and an Analysis changes 3 times in the hour
you'll be missing some versions in there.
No actual user submitted commit messages.
Object details such as the Objects pretty "Name" (caption), Description (user populated on Save Dialog), ACLs, and object custom properties are stored in a binary file format. These files have the .atr extension. The good news though is that the actual object definition is stored in a plain text file in XML (Without the .atr).
Take this as a baseline, and build upon it. Here is how you could step it up!
Use incron or other inotify based file monitoring such as ruby
based guard. Using this approach you could commit nearly
instantly anytime a user saves an object and the BI server updates
the file system.
Along with inotify, you could leverage the BI Soap API to retrieve the actual object details such as Description. This would allow you to create meaningfull commit messages. Or, parse the binary .atr file and pull the info out. Here are some good links to learn more about Web Cat ATR files: Link (Keep in mind this links are discussing OBI 10g. The binary format for 11G has changed slightly.)

Resources