Databricks and Delta cache setting - azure-databricks

I am trying to follow the instructions on the MSFT website to use delta cache and hoping someone would help me understand it a little better:
https://learn.microsoft.com/en-us/azure/databricks/delta/optimizations/delta-cache
So In the guide it mentions that I should use Standard_E or L series of VMs. Our workload is now set to use the F series machines and when I tried to use only E or L it seemed that the job ran longer and would be using more DBUs.
I did however notice that the Dv3 series allow you to use delta caching (ex: Standard_D16s_v3 VMs). I tried to run some of our workloads using those types of machines and notices that under the storage tab it now shows a similar screen as in the MSFT docs:
Problem is that I am not sure if that is the right way to go about this. The reason I wanted to try to use Dv3 VMs was because it was relatively comparable to the F series, but also seem to allow the delta caching.
I am also wondering if the MSFT recommendation of using the following settings is correct or if they can be different:
spark.databricks.io.cache.maxDiskUsage 50g
spark.databricks.io.cache.maxMetaDataCache 1g
spark.databricks.io.cache.compression.enabled false
Has any one else played with this and can recommend what they did it would be much appreciated.
As background we have the databricks clusters spin up using our Databricks Linked Service (from ADF) and in that linked service we put the following settings:
This is what sends the config settings to the automated clusters that are spun up when we execute Databricks notebooks though ADF.
Thank you

Related

Replica database is not accessible in Read Scale Availability Group in SQL Server 2017 Standard edition

I'm looking into ways of replicating databases from On-Premise environments to Azure and one of the options I found was setting up a Read-Scale availability group.
The reason I'm using a Read-Scale and not an Always On availability group, is because I don't won't to use SQL Server Enterprise edition due to the cost.
I followed a tutorial from Microsoft (MS TUTORIAL) to set this all up and in the end, I think I got it working as my database appeared on the Azure environment.
However, the problem is that my replica always stays in the Synchronizing state - which is probably due to the fact I chose Asynchronous Replication by using the AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT command - but even worse is that I can't access the database itself.
Each time I try to fire a query against it, it comes back with an Object is not accessible exception.
After some reading, I found that the cause of this might be because my replica doesn't have a secondary role. Trying to set this via the ... SECONDARY_ROLE({ALLOW_CONNECTIONS = ALL})... command, clearly states that this feature is not available in the Standard version of SQL Server.
My whole confusion comes from the fact that on the Microsoft documentation (MS DOCS), it says that With availability groups, one or more secondary replicas can be configured to support read-only access to secondary databases. which is exactly what I'm not succeeding in.
Did anybody have the same issue, or knows how to configure the Read-Scale availability group on SQL Server Standard so my second replica is accessible and readable as well?
P.S. I did look at the actual SQL Replication with Transaction Replication, but there are quite a bit of moving parts there, so I'm exploring all options before making a decision.
Based on a twitter conversation I found out that you will need to create a snapshot of the database in Secondary Replica in order to read from there.
Please read this tweeter thread.
I also added a suggestion in the feedback channel to fix the documentation.

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

Can't create multiple striped disk volume in Azure VM

I created an Azure Medium instance Windows 2012 Server and I'm having a problem striping together multiple Azure data disks into a single volume using the Server Manager tool.
In Azure I provisioned the medium instance and then created 4 data disks of 60GB each. I then rdp'ed into the server and inside Server Manager under File and Storage Services\Volumes I saw in the Disks section my 4 data disks listed under the C:\ and D:\ drives that come with this instance. I initialized my 4 data disks (later I also tried NOT initializing them) but when I clicked on "Storage Pools" in the nav bar under the Virtual Disk section I only saw 1 of my data disks.
I saw no way to add any of the other 3 data disks into my Storage Pool and then of course into the subsequent Virtual Disk. This problem limits me to just one data disk in my Virtual Disk. I have tried this many different times and the result is always the same.
Does anyone know what can be causing this or have steps to do the same thing I'm trying to do?
Thanks
If you're wondering why I'm trying to stripe these instead of using just 1 large data disk, this article explains the performance benefits of doing so:
http://blog.aditi.com/cloud/windows-azure-virtual-machines-lessons-learned/
In my blog post I explain how to do this, although perhaps the level of detail you are looking for isn't there. Still, everyone that followed this post (it was a lab) was able to create the striped volume. The blog post is a complete lab; go down half way to see the section about the striped volume. Let me know if you have any questions.
http://geekswithblogs.net/hroggero/archive/2013/03/20/windows-azure-it-roadshow-lab-i.aspx
Thanks
I hit the same problem and some Googling revealed that this is a bug in Server Manager (sorry, can't find the link). The workaround is to use PowerShell to create the pool. These commands will create a new Storage Pool called "Storage" and assign all the available disks to it:
$spaces = (Get-StorageSubSystem | where {$_.Model -eq "Storage Spaces"}[0]).UniqueID
New-StoragePool -FriendlyName "Storage" -StorageSubSystemUniqueId $spaces -PhysicalDisks (Get-PhysicalDisk -CanPool $true)

SCM management of AppFabric Cache Cluster

I'm working on building out a standard set of configurations for our cache clusters within App Fabric. My goal is to have a repeatable cache settings configuration when we load up a new environment (so server names are different, number of hosts, and other environmental factors).
My initial pass was to utilize the XML available from Export-CacheClusterConfig and simply change server names and size attributes in the <hosts> section, but I'm not sure what else is automatically registered with those values (the hostId parameter, for example).
My next approach that I've considered is a PowerShell script to simply build up the various caches with the correct parameters passed in that would simply run as a post-deploy step.
Anyone else have experience with repeatable AppFabric cache cluster deployments?
After trying both, the more successful option seems to be a combination of two factors. Management of the Cache Cluster (host information) is primarily an operations concern and is managed best by the operations team (i.e. those guys that read Server Fault). Since this information is stored in the configuration as well (and would require an XML file obtained from Export-CacheClusterConfig for each environment) it's best left to the operations team on how they want to manage it. Importing the wrong file (with the incorrect host information) has led to a number of issues.
So, we're left with PowerShell scripts. Here's a sample that I have. It could be cleaned up (check for Cache existence first) but you get the general idea. It's also much easier to store in source control (as it's just one file).
New-Cache -CacheName CRMTickets -Eviction None -Expirable false -NotificationsEnabled true
New-Cache -CacheName ConsultantCache -Eviction Lru -Expirable true -TimeToLive 60
New-Cache -CacheName WorkitemCache -Eviction None -Expirable true -TimeToLive 60

Controlling PerfMon logging on multiple systems

I need to use PerfMon to collect data from several machines, and I need to be able to turn collection on/off at certain times. I've got all the data points configured on each machine; I just need to start/stop PerfMon, and to start/stop collection of a set of data points.
For reasons I won't go into, I can't simply configure all collection from a single PerfMon instance on a single machine - I need to start/stop PerfMon data collection on multiple machines at (about) the same time.
The systems involved are all running Windows 2003 Server, and I'm unable to install any additional software on the systems.
Is it possible to do this using e.g. PowerShell (or something else that's normally installed on Windows 2003 servers)?
Take a look at logman.exe. You can use it to create countersets (if you already have a template definition) as well as to start/stop perfmon data collection. See this Overview of Performance Monitor for some information on security requirements of the account executing logman.exe.
From .bat, MSBuild or Nant you can do something like:
Logman start [logname] -s [computername]
or
Logman stop [logname] -s [computername]
Once you've collected all those log files you can use relog.exe to import them into a sql instance so that you can more easily query/report against them.
I know you mentioned you can't install any additional software, but...depending on the setup of your lab, or other environment, you might want to consider having perfmon log to a sql data store. Even if its a Sql Express instance running on a server in the environment it might make your life easier. At least it could/would skip the importing of the data into a single store to make it easy to query/analyze.
Good luck!
Z

Resources