Not able to create a new cluster in Azure Databricks - azure-databricks

I have free trial with some credits remaining , I want to create a new cluster inside azure databricks and write some code in scala notebooks , but it seems everytime i try to create a new clsuter it says terminated. Can someone help what needs to be done to create a new cluster

Using Databricks with Azure free trial subscription, we cannot use a cluster that utilizes more than 4 cores. It can be understood that you are using a Standard cluster which consumes 8 cores (4 worker and 4 driver cores).
So, try creating a ‘Single Node Cluster’ which only consumes 4 cores (driver cores) which does not exceed the limit. You can refer to the following document to understand more about single node cluster.
https://learn.microsoft.com/en-us/azure/databricks/clusters/single-node
If you need to use Standard cluster, upgrade your subscription to pay-as-you-go or use the 14-day free trial of Premium DBUs in Databricks. The following link refers to a problem like the one you are facing.
https://learn.microsoft.com/en-us/answers/questions/35165/databricks-cluster-does-not-work-with-free-trial-s.html

That is normal. You can create your Scala notebook and then attach and start the cluster from the drop down menu of the Databricks notebook.

Related

How to use Azure Spot instances on Databricks

Spot instances brings the posibility to use free resources in the cloud paying a lower price, however if the cloud demand is increased your resources will be dealocated. This is very usefull for non critical workloads whenever you can aford to loose some of the work done. More info 2 3
Databricks has the posibility to run spot instances on AWS but there is no documentation about how to do it on Azure.
Is it possible to run Databricks clusters on Azure Spot instances?
Yes, it is possible but not using Databricks UI. To use Azure spot instances on Databricks you need to use databricks cli.
Note
With the cli tool is it possible to administrate -create, edit, delete- clusters and instances-pools. However, to simplify the process, I'll focus on editing an existing cluster.
You can install databricks cli using pip install databricks-cli and configure your credentials with databricks configure --token. For more information, visit databricks documentation.
Run the command datbricks clusters list to know the ID of the cluster you want to modify:
$ datbricks clusters list
0422-112415-fifes919 Big Spark3 TERMINATED
0612-341234-jails230 Normal Spark3 TERMINATED
0212-623261-mopes727 Small 7.6 TERMINATED
In my case, I have 3 clusters. First column is the cluster ID, second one is the name of the cluster. Last column is the state.
The command databricks cluster get generates the cluster config in json format. Let's generate the json file to modify it:
databricks clusters get --cluster-id 0422-112415-fifes919 > /tmp/my_cluster.json
This file contains all the configuration related to the cluster like name, instance type, owner... In our case we are looking for the azure_attributes section. You will see something similar to:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "ON_DEMAND_AZURE",
"spot_bid_max_price": -1.0
},
...
We need to change the availability to SPOT_WITH_FALLBACK_AZURE and spot_bid_max_price with our bid price. Edit the file with your favorite tool. The result should be something like:
...
"azure_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": 0.4566
},
...
Once modified, just update the cluster with the new configuration file using databricks clusters edit:
databricks clusters edit --json-file /tmp/my_cluster.json
Now, everytime you start the cluster, the workers will be spot instances.To confirm this, you can go to the configuration tab inside the worker VM that is allocated in the resource group managed by databricks. You will see the Azure spot is active and with the price configured.
Databricks on AWS has more configuration options like SPOT for the availability field. However, until the documentation is released we'll need to wait or configure with try-error approach.

configure MSMQ cluster failover on an azure vm (windows server 2016)

I want to create a Failover cluster for MSMQ for two vm's in azure. I created two VM's in azure and have them domain joined. I can create the failover cluster with both nodes. However when i try to add a role for MSMQ i need an cluster shared disk. I tried to create a new managed disk in azure and attach it to the vm's but it still wasn't able to find the disk.
Also tried fileshare-sync, but still not working.
I found out i need iSCSI disk, there was this article https://learn.microsoft.com/en-us/azure/storsimple/storsimple-virtual-array-deploy3-iscsi-setup . But it is end of life next year.
So i am wondering if it is possible to setup a failover cluster for msmq on azure and if so how can i do it?
Kind regards,
You should be able to create a Cluster Shared Volume using Storage Spaces Direct across a cluster of Azure VMs. Here are instructions for a SQL failover cluster. I assume this should work for MSMQ, but I haven't set up MSMQ in over 10 years and I don't' know if requirements are different.

Datastax Cassandra - Amazon EC2 instance - Cluster with three node spanning across Amazon region

I am planning to create cluster with three nodes and each node will be launched in three different Amazon EC2 zone.
As per Datastax Documentation, I will use Ec2MultiRegionSnitch and replication stragey is NetworkTopologyStrategy. Below is my needs to be achieved
Cluster Size : 3 (Spanning Across Amazon EC2 Region).
Replication Factor: 3
Read and Write Level : QUORUM.
Based on the above configuration, I can survive on single node loss(Meaning that down of any one of amazon region. Correct me if I am wrong).
In order to achieve the above configuration, I have two option
Option-1 : Using Datastax provided Amazon EC2 AMI image.
This option launch the instance with almost all components needed to run cassandra with some monitoring tools(opscenter..etc)
But It store all data on EC2 Instance Store hence data persists only during the life of the instance and the storage size depends upon instance type.
Option-2 : Using Customised installation
In this option, I have to launch Amazon EC2 Ubuntu AMI,installing JAVA,installing Datastax community edition.
This option enable me to store all my data on EBS. Hence I can expand EBS whenever I needed and the same time I can restore any node using EBS snapshot.
My Question:
Which one of the option is suitable for my needs?.
Note:
I read the documentation provided by Datastax and very new to cassandra. Hence, Whatever inputs you provided will be very useful to me.
Thanks
It's not true that you get Datastax AMI only with EC2 ephemeral storage. Starting from version 2.5 they claim you can choose EBS as well: Introducing the DataStax Auto-Clustering AMI 2.5. That's an relatively easy way of getting started which I've personally chosen.
Should you choose EBS or EC2 ephemeral storage?
The answer is: it depends...
The past (~2012-2013):
EC2 instances with ephemeral storage were a better choice. There were detailed performance benchmarks over the years which indicated that EBS is getting better, but still, attached physical drives were better.
The past (~2014):
EC2 choice is still better. Datastax wrote a nice post about pricing, network and failure resilience: What is the story with AWS storage?
Present (~2016):
instaclustr claims:
By running Cassandra on Amazon EBS, you can run denser, cheaper
Cassandra clusters with just as much availability as ephemeral storage
instances.
Nice presentation here: AWS re:Invent 2015 | (BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second on 60 Nodes
All in all, I suggest you doing a TCO analysis and if there isn't a big difference in price, choose EBS - because of out of the box ability to make a snapshot. What's more, chances are EBS will be improved over the time.

Ambari scaling memory for all services

Initially I had two machines to setup hadoop, spark, hbase, kafka, zookeeper, MR2. Each of those machines had 16GB of RAM. I used Apache Ambari to setup the two machines with the above mentioned services.
Now I have upgraded the RAM of each of those machines to 128GB.
How can I now tell Ambari to scale up all its services to make use of the additional memory?
Do I need to understand how the memory is configured for each of these services?
Is this part covered in Ambari documentation somewhere?
Ambari calculates recommended settings for memory usage of each service at install time. So a change in memory post install will not scale up. You would have to edit these settings manually for each service. In order to do that yes you would need an understanding of how memory should be configured for each service. I don't know of any Ambari documentation that recommends memory configuration values for each service. I would suggest one of the following routes:
1) Take a look at each services documentation (YARN, Oozie, Spark, etc.) and take a look at what they recommend for memory related parameter configurations.
2) Take a look at the Ambari code that calculates recommended values for these memory parameters and use those equations to come up with new values that account for your increased memory.
I used this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/determine-hdp-memory-config.html
Also, Smartsense is must http://docs.hortonworks.com/HDPDocuments/SS1/SmartSense-1.2.0/index.html
We need to define cores, memory, Disks and if we use Hbase or not then script will provide the memory settings for yarn and mapreduce.
root#ttsv-lab-vmdb-01 scripts]# python yarn-utils.py -c 8 -m 128 -d 3 -k True
Using cores=8 memory=128GB disks=3 hbase=True
Profile: cores=8 memory=81920MB reserved=48GB usableMem=80GB disks=3
Num Container=6
Container Ram=13312MB
Used Ram=78GB
Unused Ram=48GB
yarn.scheduler.minimum-allocation-mb=13312
yarn.scheduler.maximum-allocation-mb=79872
yarn.nodemanager.resource.memory-mb=79872
mapreduce.map.memory.mb=13312
mapreduce.map.java.opts=-Xmx10649m
mapreduce.reduce.memory.mb=13312
mapreduce.reduce.java.opts=-Xmx10649m
yarn.app.mapreduce.am.resource.mb=13312
yarn.app.mapreduce.am.command-opts=-Xmx10649m
mapreduce.task.io.sort.mb=5324
Apart from this, we have formulas there to do calculate it manually. I tried with this settings and it was working for me.

how to setup the cassandra cluster in cloud

I'm new to Cassandra. I installed Cassandra on my ec2 machine, but how can I configure Cassandra in cluster mode.
Is there any link that will be helpful.
Thanks in advance
Step 3: Running a cluster (let me know if that is not enough for you)
You should also read the last section (updated yesterday (110310) by jbellis) on this page
I just tried this one today :
https://cloud.google.com/solutions/cassandra/
Google cloud will let you deploy a pre-configured cluster of Cassandra nodes within a few mouse clicks. When this service is not free, they still let you have a trial for 60 days.
** Another option is to follow their script on how to deploy multiple nodes cluster:
https://cloud.google.com/solutions/cassandra/deployment-details

Resources