Is it possible to revert a specific change if there are no dependencies in sqitch? For example, I set up my project like the code below and deploy it and load some data. A day or so later (or perhaps the same day), a stakeholder decides that I need to add some more columns to fct_tickets or make another change to that table.
If I try to revert fct_tickets, it will revert all subsequent tables which is a shame because I have loaded data to them already.
I have tried certain flags (--upon, --unto, etc.), but it still wants to revert everything after fct_tickets in my sqitch.plan file.
sqitch add scm_example --template pg_create_schema -s schema=example -n 'Create schema for Example data.'
sqitch add fct_tickets --requires scm_example -n 'Create table for ticket data.'
sqitch add fct_chats --requires scm_example -n 'Create table for chat data.'
sqitch add fct_calls --requires scm_example -n 'Create table for call data.'
sqitch add dim_users --requires scm_example -n 'Create table for user mapping data.'
sqitch add dim_source_files --requires scm_example -n 'Create table to track all files downloaded from the SFTP.'
I could alter the table and add columns, but when it is a fresh day 1 project, it is nice to have a clean slate.
It is no big deal - I am just wondering if I am missing something simple since fct_tickets has no dependencies.
No, you cannot revert a single change other than the most recently-deployed change. This is by design. Sqitch uses a Merkle tree pattern similar to Git and Blockchain to ensure deployment integrity. This means that deployment is a linked chain in the order specified in your plan file. If you have deployed your Sqitch project to an environment in which data has been loaded, you're better off adding a new change to add the new column.
A pattern I often follow when doing database development prior to a production release is to rebase changes often. That means changing the fct_tickets deploy script to add the new column, then rebating on fct_tickets^, which will revert all changes after the change just before fct_tickets, and then redeploy them all. I avoid loading data in such systems as part of the development process, instead of either keeping data that are essential to the data model in a separate change, or else in a separate file that gets loaded independently, say unit test fixtures.
If you have a test system or something that other folks have added data to, and it's not a final tagged release, then your best option is probably to dump the fct_tickets table to a file, rebase with the change, then reload the data from that file. Be sure to set a default on the column, or else modify the dump file to add the data in each row before loading it into the revamped table.
Related
I have an azure blob container with data which I have not uploaded myself. The data is not locally on my computer.
Is it possible to use dvc to download the data to my computer when I haven’t uploaded the data with dvc? Is it possible with dvc import-url?
I have tried using dvc pull, but can only get it to work if I already have the data locally on the computer and have used dvc add and dvc push .
And if I do it that way, then the folders on azure are not human-readable. Is it possible to upload them in a human-readable format?
If it is not possible is there then another way to download data automatically from azure?
I'll build up on #Shcheklein's great answer - specifically on the 'external dependencies' proposal - and focus on your last question, i.e. "another way to download data automatically from Azure".
Assumptions
Let's assume the following:
We're using a DVC pipeline, specified in an existing dvc.yaml file. The first stage in the current pipeline is called prepare.
Our data is stored on some Azure blob storage container, in a folder named dataset/. This folder follows a structure of sub-folders that we'd like to keep intact.
The Azure blob storage container has been configured in our DVC environment as a DVC 'data remote', with name myazure (more info about DVC 'data remotes' here)
High-level idea
One possibility is to start the DVC pipeline by synchronizing a local dataset/ folder with the dataset/ folder on the remote container.
This can be achieved with a command-line tool called azcopy, which is available for Windows, Linux and macOS.
As recommended here, it is a good idea to add azcopy to your account or system path, so that you can call this application from any directory on your system.
The high-level idea is:
Add an initial update_dataset stage to the DVC pipeline that checks if changes have been made in the remote dataset/ directory (i.e., file additions, modifications or removals).
If changes are detected, the update_datset stage shall use the azcopy sync [src] [dst] command to apply the changes on the Azure blob storage container (the [src]) to the local dataset/ folder (the [dst])
Add a dependency between update_dataset and the subsequent DVC pipeline stage prepare, using a 'dummy' file. This file should be added to (a) the outputs of the update_dataset stage; and (b) the dependencies of the prepare stage.
Implementation
This procedure has been tested on Windows 10.
Add a simple update_dataset stage to the DVC pipeline by running:
$ dvc stage add -n update_dataset -d remote://myazure/dataset/ -o .dataset_updated azcopy sync \"https://[account].blob.core.windows.net/[container]/dataset?[sas token]\" \"dataset/\" --delete-destination=\"true\"
Notice how we specify the 'dummy' file .dataset_updated as an output of the stage.
Edit the dvc.yaml file directly to modify the command of the update_dataset stage. After the modifications, the command shall (a) create the .dataset_updated file after the azcopy command - touch .dataset_updated - and (b) pass the current date and time to the .dataset_updated file to guarantee uniqueness between different update events - echo %date%-%time% > .dataset_updated.
stages:
update_dataset:
cmd: azcopy sync "https://[account].blob.core.windows.net/[container]/dataset?[sas token]" "dataset/" --delete-destination="true" && touch .dataset_updated && echo %date%-%time% > .dataset_updated # updated command
deps:
- remote://myazure/dataset/
outs:
- .dataset_updated
...
I recommend editing the dvc.yaml file directly to modify the command, as I wasn't able to come up with a complete dvc add stage command that took care of everything in one go.
This is due to the use of multiple commands chained by &&, special characters in the Azure connection string, and the echo expression that needs to be evaluated dynamically.
To make the prepare stage depend on the .dataset_updated file, edit the dvc.yaml file directly to add the new dependency, e.g.:
stages:
prepare:
cmd: <some command>
deps:
- .dataset_updated # add new dependency here
- ... # all other dependencies
...
Finally, you can test different scenarios on your remote side - e.g., adding, modifying or deleting files - and check what happens when you run the DVC pipeline up till the prepare stage:
$ dvc repro prepare
Notes
The solution presented above is very similar to the example given in DVC's external dependencies documentation.
Instead of the az copy command, it uses azcopy sync.
The advantage of azcopy sync is that it only applies the differences between your local and remote folders, instead of 'blindly' downloading everything from the remote side when differences are detected.
This example relies on a full connection string with an SAS token, but you can probably do without it if you configure azcopy with your credentials or fetch the appropriate values from environment variables
When defining the DVC pipeline stage, I've intentionally left out an output dependency with the local dataset/ folder - i.e. the -o dataset part - as it was causing the azcopy command to fail. I think this is because DVC automatically clears the folders specified as output dependencies when you reproduce a stage.
When defining the azcopy command, I've included the --delete-destination="true" option. This allows synchronization of deleted files, i.e. files are deleted on your local dataset folder if deleted on the Azure container.
Please, bear with me, since you have a lot of questions. Answer needs a bit structure and background to be useful. Or skip to the very end to find some new ways of doing Is it possible to upload them in a human-readable format? :). Anyways, please let me know if that solves your problem, and in general would be great to have a better description of what you are trying to accomplish at the end (high level description).
You are right that by default DVC structures its remote in a content-addressable way (which makes it non human-readable). There are pros and cons to this. It's easy to deduplicate data, it's easy to enforce immutability and make sure that no one can touch it directly and remove something, directory names in projects make it connected to actual project and their meaning, etc.
Some materials on this: Versioning Data and Models, my answer of on how DVC structures its data, upcoming Data Management User Guide section (WIP still).
Saying that, it's clear there are downsides to this approach, especially when it comes to managing a lot of objects in the cloud (e.g. millions of images, etc). To name a few concerns that I see a lot as a pattern:
Data has been created (and being updated) by someone else. There is some ETL, third party tool, etc. We need to keep that format.
Third party tool expect to have data in "human" readable way. It doesn't integrate with DVC to being able to access it indirectly via Git. (one of the examples - Label Studio need direct links to S3).
It's not practical to move all of data into DVC, it doesn't make sense to instantiate all the files at once as one directory. Users need slices, usually based on some annotations (metadata), etc.
So, DVC has multiple features to deal with data in its own original layout:
dvc import-url - it'll download objects, it'll cache them, and will by default push (dvc push) to remote to again save them to guarantee reproducibility (this can be changed). This command creates a special file .dvc that is being used to detect changes in the cloud to see if DVC needs to download something again. It should cover the case for "to download data automatically from azure".
dvc get-url - this more or less wget or rclone or aws s3 cp, etc with multi cloud support. It just downloads objects.
A bit advanced thing (if you DVC pipelines):
Similar to import-url but for DVC pipelines - external dependencies
The the third (new) option. It's in beta phase, it's called "cloud versioning" and essentially it tries to keep the storage human readable while still benefit from using .dvc files in Git if you need them to reference an exact version of the data.
Cloud Versioning with DVC (it's WPI when I write this, if PR is merged it means you can find it in the docs
The document summarizes well the approach:
DVC supports the use of cloud object versioning for cases where users prefer to retain their original filenames and directory hierarchy in remote storage, in exchange for losing the de-duplication and performance benefits of content-addressable storage. When cloud versioning is enabled, DVC will store files in the remote according to their original directory location and
filenames. Different versions of a file will then be stored as separate versions of the corresponding object in cloud storage.
Problem: Set up a workflow in CRM Dynamics 365 Sales that starts when the value of a specific field changes. But it turned out that the process does not start if changes are made to old CRM records (which were created before starting the process itself).
Question: Is there any method how can I make CRM start the process even for old records? I am sorry that everything is in Russian. I work in this version.
The process works correctly when creating a record and when editing a field in a new record. And when editing a field in an existing record, the process does not start
To make that Workflow to trigger on all records, make the scope as “Organization” instead of “User” - it should work as intended. Read more
It’s not about when it is created, probably those records are owned by somebody else. That’s why user scoped WF is not triggering at all.
Normally workflows trigger any future relevant changes. Otherwise, we should create a scenario where we cause that trigger. A couple of options,
As you have already set the workflow to run on a specific field change, you can make an update to that field and save the record which should trigger the workflow. If it's a very less number of records it's possible otherwise it's not a good idea to manually update all these records. If you don't want to do this manually you can update the records using any other option like a console app which makes updates to all the records (this would be faster assuming this is a one-time activity you have to do.)
Make this workflow on demand and trigger the same manually for all the records you want to run the workflow. Again this is a manual process but cleaner than the first one.
You do not need to do any manual update. The workflow you creates should be enough to kick in.
make sure your workflow has trigger on change of field. Screenshot for reference. It does not matter when the Workflow is created. As long as it satisfies condition it will kick in.
After some time I would like to optimize changelog I use in my Spring-Boot application. I use SQL syntax in my liquibase files.
I have updated some of the old sqls (without changing changeset name). And now I got:
Caused by: liquibase.exception.ValidationFailedException: Validation Failed:
37 change sets check sum
classpath:db/changelog/db.changelog-master.sql::4::siewer was: 8:8560502cf93e550076df8a7dc82a45b6 but is now: 8:543d80fbaa5a5b468e56ac6ef4705e58
Is it possible to change old changesets and update hashes without executing the migration? I would like already in-use databases to work without changes (but to be able to get new migrations). At this moment when I start an application with fresh database everything is ok, but when I want to run app on already populated database I got errors.
Any tips? Is this possible to be done?
You can run your migration scripts against a clear database, copy the hashes (or copy it directly from the error log one by one), and update the hash directly in the liquibase table. Just make sure it's updated together with the deployment of the new changesets, otherwise the app won't start.
Yes, it's possible. You can use <validCheckSum> for it. This will tell liquibase what the checksum for this changeSet actually is.
So:
<changeSet id="foo" author="bar">
<validCheckSum>your_checksum_here</validCheckSum>
<!-- changeSet logic here -->
</changeSet>
Or if you don't care about the checksum of the changeSet at all, you can use ANY instead of actual checksum:
<changeSet id="foo" author="bar">
<validCheckSum>ANY</validCheckSum>
<!-- changeSet logic here -->
</changeSet>
And some philosophy about it:
You shouldn't alter the existing changeSets. If you need to make the changes to your database schema, then you should write new changeSets in addition to the old ones.
I am using the SQL Deploy DACPAC community contributed step to deploy my dacpac to the server within Octopus.
It has been setup correctly and has been working fine until the below situation occurs.
I have a situation where I am dropping columns but the deploy keeps failing due to rows being detected. I am attempting to use /p:BlockOnPossibleDataLoss=false as an "Additional deployment contributor arguments" but it seems to be ignored.
Can anyone guide me to what is wrong?
The publish properties should have DropObjectsNotInSource, try to set it to True.
You might want to fine tune it, to avoid dropping users, permissions, etc.
After multiple updates by the original author, this issue was still not resolved. The parameter has actually since been completely removed since version 11.
Initially, I added a pre-deployment script that copied all the data from the tables that were expected to fail, delete all the data, allow the table schema to update as normal, and in a post-deployment script re-insert all the data into the new structure. The problem with this was that for data that could be lost, a pre-deployment and post-deployment script was required when it wasn't really needed.
Finally, I got around this by duplicating the community step "SQL - Deploy DACPAC" (https://library.octopus.com/step-templates/58399364-4367-41d5-ad35-c2c6a8258536/actiontemplate-sql-deploy-dacpac) by saving it as a copy from within Octopus. I then went into the code, into the function Invoke-DacPacUtility, and added the following code:
[bool]$BlockOnPossibleDataLoss into the parameter list
Write-Debug (" Block on possible data loss: {0}" -f $BlockOnPossibleDataLoss) into the list of debugging
if (!$BlockOnPossibleDataLoss) { $dacProfile.DeployOptions.BlockOnPossibleDataLoss = $BlockOnPossibleDataLoss; } into the list of deployment options
Then, I went into the list of parameters and added as follows:
Variable name: BlockOnPossibleDataLoss
Label: Block on possible data loss
Help test: True to stop deployment if possible data loss if detected; otherwise, false. Default is true.
Control type: Checkbox
Default value: true
With this, I am able to change the value of this parameter with the checkbox when using the step in the process of the project.
I built an Analysis that displayed Results, error free. All is well.
Then, I added some filters to existing criteria sets. I also copied an existing criteria set, pasted it, and modified it's filters. When I try to display results, I see a View Display Error.
I’d like to revert back to that earlier functional version of the analyses, hopefully without manually undoing the all of filter & criteria changes I made since then.
If you’ve seen a feature like this, I’d like to hear about it!
Micah-
Great question. There are many times in the past when we wished we had some simple SCM on the Oracle BI Web Catalog. There is currently no "out of the box" source control for the web catalog, but some simple work-arounds do exist.
If you have access server side where the web catalog lives you can start with the following approach.
Oracle BI Web Catalog Version Control Using GIT Server Side with CRON Job:
Make a backup of your web catalog!
Create a GIT Repository in the web cat
base directory where the root dir and root.atr file exist.
Initial commit eveything. ( git add -A; git commit -a -m
"initial commit"; git push )
Setup a CRON job to run a script Hourly,
Minutely, etc that will tell GIT to auto commit any
adds/deletes/modifications to your GIT repository. ( git add -A; git
commit -a -m "auto_commit_$(date +"%Y-%m-%d_%T")"; git push )
Here are the issues with this approach:
If the CRON runs hourly, and an Analysis changes 3 times in the hour
you'll be missing some versions in there.
No actual user submitted commit messages.
Object details such as the Objects pretty "Name" (caption), Description (user populated on Save Dialog), ACLs, and object custom properties are stored in a binary file format. These files have the .atr extension. The good news though is that the actual object definition is stored in a plain text file in XML (Without the .atr).
Take this as a baseline, and build upon it. Here is how you could step it up!
Use incron or other inotify based file monitoring such as ruby
based guard. Using this approach you could commit nearly
instantly anytime a user saves an object and the BI server updates
the file system.
Along with inotify, you could leverage the BI Soap API to retrieve the actual object details such as Description. This would allow you to create meaningfull commit messages. Or, parse the binary .atr file and pull the info out. Here are some good links to learn more about Web Cat ATR files: Link (Keep in mind this links are discussing OBI 10g. The binary format for 11G has changed slightly.)