Azure Databricks processing files differently based on the confuguration - azure-databricks

We've an application which processes huge file(excel) and calculate data from that file based on different conditions(written/coded in scala notebook).
The issue which we're facing is the inconsistency of results produced by the same file for different time and/or different configuration for Azure Databricks Compute.
We've already double check our scala notebook code and which doesn't has any bug, it might be something from configuration end(not sure).
Below is the current configuration of my dev compute

Related

Magic committer not improving performance in a Spark3+Yarn3+S3 setup

What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere

Any performance difference between loading ML Module (Xquery / Javascript) from physical disk localation and loading from ML Module DB inside ML?

By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.

Couchbase/Elasticsearch connector for multiple buckets

Is there a way to replicate 2 or many couchbase buckets to elasticsearch using a single configuration file?
I actually use this version of the couchbase elasticsearch connector:
https://docs.couchbase.com/elasticsearch-connector/4.0/index.html
I do replicate my data correctly, but need to run a command per bucket using a different configuration file (.toml) each time.
Could not by the way run the cbes command multiple times on the same server as the metrics port 31415 is already in use.
Is there any way to handle many connector groups in one time?
In version 4.0 a single connector process can replicate from only one bucket. This is because the indexing rules and all of the underlying network connections to Couchbase Server are scoped to the bucket level.
The current recommendation is to create multiple config files and run multiple connector processes. It's understood that this can be complicated to manage if you're replicating a large number of buckets.
If you're willing to get creative, you could use the same config file template for multiple buckets. The idea is that you'd write a config file with some placeholders in it, and then generate the actual config file by running a script that replaces the placeholders with the correct values for each connector.
The next update to the connector will add built-in support for environment variable substitution in the config file. This could make the templating approach easier.
Here are some options for avoiding the metrics port conflict:
Disable metrics reporting by setting the httpPort key in the [metrics] section to -1.
OR Use a random port by setting it to 0.
OR Use the templating idea described above, and plug a unique port number into each generated config file.
It's worth mentioning that a future version of the connector will support something we're calling "Autonomous Operations Mode". When the connector runs in this mode, the configuration will be stored in a central location (probably a Consul server). It will be possible to reconfigure a connector group on-the-fly, and add or remove workers to the group without having to stop all the workers and edit their config files. Hopefully this will simplify the management of large deployments.

Automating H2O Flow: run flow from CLI

I’ve been an h2o user for a little over a year and a half now, but my work has been limited to the R api; h2o flow is relatively new to me. If it's new to you as well, it's basically 0xdata's version of iPython, however iPython let's you export your notebook to a script. I can't find a similar option in flow...
I’m at the point of moving a model (built in flow) to production, and I'm wondering how to automate it. With the R api, after the model was built and saved, I could easily load it in R and make predictions on the new data simply by running a nohup Rscript <the_file> & from CLI, but I’m not sure how I can do something similar with flow, especially since it’s running on Hadoop.
As it currently stands, every run is broken into three pieces with the flow creating a relatively clunky process in the middle:
preprocess data, move it to hdfs
start h2o on hadoop, nslookup the IP address h2o is running on, manually run the flow cell-by-cell
run the post-prediction clean-up and final steps
This is a terribly intrusive production process, and I want to tie all the ends up, however flow is making it rather difficult. To distill the question: is there a way to compress the flow into a hadoop jar and then later just run the jar like hadoop jar <my_flow_jar.jar> ...?
Edit:
Here's the h2o R package documentation. The R API allows you to load an H2O model, so I tried loading the flow (as if it were an H2O model), and unsurprisingly it did not work (failed with a water.api.FSIOException) as it's not technically an h2o model.
This is really late, but (now) h2o flow models have auto-generated java code that represents the trained model (called a POJO) that can be cut and pasted (say from your remote hadoop session to a local java file). See here for a quickstart tutorial on how to use the java object (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/1/docs-website/h2o-docs/pojo-quick-start.html). You'll have to refer to the h2o java api (https://h2o-release.s3.amazonaws.com/h2o/rel-turing/8/docs-website/h2o-genmodel/javadoc/hex/genmodel/easy/EasyPredictModelWrapper.html) to start customizing how you want to use the POJO, but you essentially use it as a black box that makes predictions on properly formated inputs.
Assuming you hadoop session is remote, replace "localhost" in the example with the IP address of your (remote) flow session.

Spring-XD: Deployment of modules to certain containers

Three questions regarding deployment of modules to Spring XD container:
For certain sources and sinks it's necessary to say to which container a module should be deployed. Let's say we have a lot of containers on different machines, and we want to establish a stream reading a log file from one machine. The source module of type tail has to be deployed to the container running on the machine with the log file. How can you do that?
You may want to restrict the execution of modules to a group of containers. Let's say we have some powerful machines for our batch processing with containers on it, and we have other machines where our container runs parallel to some other processes only for ingesting data (log files etc.). Is that possible?
If we have a custom module, is it possible to add the module xml and the jars just to certain containers, so that those modules are just executed there? Or is it necessary that we have the same module definitions on all containers?
Thanks!
You bring up excellent points, we have been doing some design work around these issues, in particular #1 and #2 and will have some functionality here in our next milestone release in about 1 month time.
In terms of #3, the model for resolving the jars that are loaded in the containers requires the local file system or a shared file system to resolve the classpath. This is also something that has come up in our prototypes of using Spring XD on the CloudFoundry PaaS and we want to provide a more dynamic/at runtime ability to located and load new modules. No estimate on when that will be address.
Thanks for questions!
Cheers,
Mark

Resources