I have a streaming pipeline from Azure Data Factory that suddenly vanished from Monitoring tab from Azure Data Factory page. It is running on Databricks from quite a long time(May be more than 45 days without interruptions).
One day the pipeline is no longer shown in ADF Pipeline runs in Monitoring tab and the previous runs disappeared as well but none of the alerts set from Databricks side kicked off. Turns out the job is still running from Databricks side but the corresponding ADF pipeline details disappeared along with old runs for the same streaming pipeline.
How is this possible? Any reason for this to happen?
If it still run, the monitor of it should be find.
May be you can check the filter of the pipeline monitor:
Related
I am migrating databricks workspace from one account to another account, in this process I need to refresh the databricks workspace with updated data (notebooks, users, groups, clusters, database and tables)
Do we have any process to cleanup the databricks workspace?
You can update the Databricks workspace by using Microsoft given API.
To update the Azure Databricks workspace, you can use the below API which will update the specified workspace:
PATCH https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Databricks/workspaces/{workspaceName}?api-version=2018-04-01
To cleanup the resources, you can terminate the cluster. To do so, from the Azure Databricks workspace, from the left pane, select Clusters. For the cluster you want to terminate, move the cursor over the ellipsis under Actions column, and select the Terminate icon. This stops the cluster.
If you do not manually terminate the cluster it will automatically stop, provided you selected the Terminate after __ minutes of inactivity checkbox while creating the cluster. In such a case, the cluster automatically stops, if it has been inactive for the specified time.
We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.
It is possible to run local activities that don't require a connection to the cadence server. Is there a proper way to run workflows locally, too, in case of a cadence outage?
I'm using the Go client.
The service connection is required to make any progress in the workflow execution including scheduling activities.
To run workflows locally you can use local version of the Cadence service. Such version can be easily installed through docker compose.
If you need high availability setup you can use multi-cluster Cadence. So a single cluster outage is not going to cause workflow execution outager.
Followed TeamCity's description of running a TeamCity build server on AWS with a cloudformation template. Launched it, it gets stuck at AgentService (Resource creation initiated). Waited for half an hour, no progress.
Resources tab shows the following:
What am I doing wrong here?
(For me) this typically happens if the service cannot be started for some reasons. For instance if the cluster does not have enough suitable instances to start your service or for some other reason.
For diagnostic, check your service in the ECS cluster and there check events and in tasks of your service, check stopped tasks (and reasons they were stopped).
Got a tip from a colleague that if you are creating a CF template based service, it may take up to 3(!) hours. Tried again today, after 3 hours it was up and running.
The reason for this is the setup of the ECS, which involves DNS setup for an internet facing service.
I have TeamCity (7.0.2) successfully spinning up an EC2 VM from a custom AMI, running our build, and sending back the build artifacts.
However, even when I used to do this with older TeamCity versions, I was always unhappy with the notion that it simply terminates the instances after they are done, and then creates new instances using the configured AMI next time a build agent is needed.
Can I get TeamCity to issue "stop" commands instead, followed by "start" commands? This has a tonne of advantages - quicker spin-up time, allowing for named instances in the agent stats, and saving the Mercurial clone to EBS for the next build are just three.
p.s. I guess I could use chained builds to call the EC2 API directly rather than use the in-built cloud support, but that sounds like a lot of work and feels flaky
We plan to provide support of EBS instances start stop in TeamCity 7.1
Please vote for TW-16419
TeamCity 7.0 may leak EBS volumes TW-12517