We have a Configuration driven ETL framework built on top of Apache Spark and we are in the process of designing a common shell script that can be used for doing the spark-submit.
We looked at various blog posts and spark documentation to get some direction for our work but all we see is the specification for spark-submit ( Just like UNIX man page ) but what we are looking for is some kind of how-to article or best practices that can be followed while designing common shell script for spark-submit.
Here is our plan so far.
To set the context for our shell script, assume that we have many applications in our project and each of the applications will have many jobs.
Environment Details:
Spark Version: 2.3.2,
Deployment Mode: Cluster,
Programming Language: Scala,
Scala Version: 2.11.8,
Cluster Manage: YARN
The core of our script is using the --properties-file option that comes with spark-submit command. All the configuration that we set through "--conf" should come from the configuration file and we can make use of --properties-file option to pass the configuration file as an input. This way we are making the shell script immune to the configuration changes.
To enable the selection of configurations dynamically, we are planning to have many configuration files as listed below
Based on memory - we will have many files for each of the memory-related categories (x-small, small, medium, large, x-large). All the memory-related properties like cores and memory for driver and executor based on category go here.
Common - All the common properties like (jars, convertMetastoreParquet, and any other common conf go here.
Application-Specification specific properties or overrides can go here.
We will get the application name, memory sizing category as an input to the shell script, and based on the memory category, we choose the corresponding conf file. Contents of which are appended with the common conf file and then if there are application-specific override (optional) then Application-specific file also appended with the existing configuration. Finally, the consolidated file is passed as an input to the --properties-file option.
What I wanted to know is
Are there any blogs/videos that list down the best practices for designing the shell script.
We are planning to use dynamic memory allocation so we are not setting the instance parameter through the configuration files. Is there any downside to using dynamic memory allocation in production cluster other than the additional time taken for provisioning/decommissioning the resources during run time.
Thanks
Related
What I am trying to achieve?
I am trying to enable the S3A magic committer for my Spark3.3.0 application running on a Yarn (Hadoop 3.3.1) cluster, to see performance improvements in my app during S3 writes. IIUC, my Spark application is writing about 21GBs of data with 30 tasks in the corresponding Spark stage (see below image).
My setup
I have a server which has the Spark client. The Spark client submits the application on Yarn cluster via the client-mode with PySpark.
What I tried
I am using the following config (setting via PySpark Spark-conf) to enable the committer:
"spark.sql.sources.commitProtocolClass": "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol"
"spark.sql.parquet.output.committer.class": "org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter"
"spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a": "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"
"spark.hadoop.fs.s3a.committer.name": "magic"
"spark.hadoop.fs.s3a.committer.magic.enabled": "true"
I also downloaded the spark-hadoop-cloud jar to the jars/ directory of the Spark-Home on the Nodemanagers and my Spark-client servers.
Changes that I see after applying the aforementioned configs:
I see PRE __magic/ directory if I run aws s3 ls <write-path> when the job is running.
I don't see the warning WARN AbstractS3ACommitterFactory: Using standard FileOutputCommitter to commit work. This is slow and potentially unsafe. anymore.
A _SUCCESS file gets created with (JSON) content. One of the key-value that I see in that file is "committer" : "magic".
Hence, I believe my configs are getting applied correctly.
What I expect
I have read in multiple articles that this committer is expected to show a performance boost (e.g. this article claims 57-77% time reduction). Hence, I expect to see significant reduction (from 39s) in the "duration" column of my "paruqet" stage, when I use the above shared configs.
Some other point that might be of value
When I use "spark.sql.sources.commitProtocolClass": "com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol", my app fails with the error java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol.
I have not looked into enabling S3gaurd, as S3 now provides strong consistency.
correct. you don't need s3guard
the com.hortonworks binding was for the wip committer work. the binding classes for wiring up spark/parquet are all in spark-hadoop-cloud and have org.spark prefixes. you seem to be ok there
the simple test for what committer is live is to print the JSON _SUCCESS file. If that is a 0 byte file, you are still using the old committer. it does sound like you are.
grab the latest spark+hadoop build you can get, there's always ongoing improvements, with hadoop 3.3.5 doing a big enhancement there.
you should see performance improvements compared to the v1 committer, with commit speed O(files) rather than O(data). it is also correct, which the v1 algorithm doesn't offer on s3 (and which v2 doesn't offer anywhere
By default, ML HTTP server will use the Module DB inside ML.
(It seems all ML training materials refer to that type of configuration.)
Any changes in the XQuery programs will need to upload into the Module DB first. That could be accomplished by using mlLoadModules or mlReloadModules ml-gradle commands.
CI/CD does not access the ML cluster directly. Everything is via ml-gradle from a machine dedicated from code deployment to different ML enviroments like dev/uat/prod etc.
However it is also possible to configure the ML app server to use the XQuery program from physical disk location like below screenshot.
With that configuration, it is not required to reload the programs into ML Module DB.
The changes in the program have to be in the ML server itself. CI/CD will need to access to the ML cluster directly. One advantage of this way is that developer will easily see whether the changes in the program have been indeed deployed, as all changes are sitting as physical readable text files in the disk.
Questions:
Which way is better? Why?
Any ML query perforemance difference between these two different approaches?
For the physical file approach, does it mean that CI/CD will need to deploy the program changes to all the ML hosts in the ML cluster? (I guess it is not a concern if HTTP server refers XQuery programs from Module DB inside ML. ML cluster will auto sync the code among different hosts.)
In general, it's recommended to deploy modules to a database rather than the filesystem.
This makes deployment more simple and easy, as you only have to load the module once into the modules database, rather than putting the file on every single host. If you use the filesystem, then you need to put those files on every host in the cluster.
With a modules database, if you were to add nodes to the cluster, you don't have to also deploy the modules. You can then also take advantage of High Availability, backup and restore, and all the other features of a database.
Once a module is read, it is loaded into caches, so the performance impact should be negligible.
If you plan to use REST extensions, then you would need a modules database so that the configurations can be installed in that database.
Some might look to use filesystem for simple development on a single node, in which changes saved to the filesystem are made available without re-deploying. However, you could use something like the ml-gradle mlWatch task to auto-deploy modules as they are modified on the filesystem and achieve effectively the same thing using a modules database.
I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html
I'm running an AWS EMR Pig job using script-runner.jar as described here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html
Now, I want to hook up Netflix' Lipstick to monitor my scripts. I set up the server, and in the wiki here: https://github.com/Netflix/Lipstick/wiki/Getting-Started I can't quite figure out how to do the last step:
hadoop jar lipstick-console-[version].jar -Dlipstick.server.url=http://$LIPSTICK_URL
Should I substitute script-runner.jar with this?
Also, after following the build process in wiki I ended up with 3 different console jars:
lipstick-console-0.6-SNAPSHOT.jar
lipstick-console-0.6-SNAPSHOT-withHadoop.jar
lipstick-console-0.6-SNAPSHOT-withPig.jar
What is the purpose of the latter two jars?
UPDATE:
I think I'm making progress, but it still does not seem to work.
I set the pig.notification.listener parameter as described here and lipstick server url. There is more than one way to do it in EMR. Since I am using ruby API, I had to specify a step
hadoop_jar_step:
jar: 's3://elasticmapreduce/libs/script-runner/script-runner.jar'
properties:
- pig.notification.listener.arg: com.netflix.lipstick.listeners.LipstickPPNL
- lipstick.server.url: http://pig_server_url
Next, I added lipstick-console-0.6-SNAPSHOT.jar to hadoop classpath. For this, I had to create a bootstrap action as follows:
bootstrap_actions:
- name: copy_lipstick_jar
script_bootstrap_action:
path: #s3 path to bootstrap_lipstick.sh
where contents of bootstrap_lipstick.sh is
#!/bin/bash
hadoop fs -copyToLocal s3n://wp-data-west-2/load_code/java/lipstick-console-0.6-SNAPSHOT.jar /home/hadoop/lib/
The bootstrap action copies the lipstick jar to cluster nodes, and /home/hadoop/lib/ is already in hadoop classpath (EMR takes care of that).
It still does not work, but I think I am missing something really minor ... Any ideas appreciated.
Thanks!
Currently Lipstick's Main class is a drop-in replacement to Pig's Main class. This is a hack (and far from ideal) to have access to the logical and physical plans for your script before and after optimization that are simply not accessible otherwise. As such it unfortunately won't work to just register the LipstickPPNL class as a PPNL for Pig. You've got to run Lipstick Main as though it was Pig.
I have not tried to run lipstick on EMR but it looks like you're going to need to use a custom jar step, not a script step. See the docs here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-launch-custom-jar-cli.html
The jar name would be the lipstick-console-0.6-SNAPSHOT-withHadoop.jar. It contains all the necessary dependencies to run Lipstick. Additionally the lipstick.server.url will need to be set.
Alternatively, you might take a look at https://www.mortardata.com/ which runs on EMR and has lipstick integration built-in.
I'm new to Hadoop, and now have to process a input file. I want to process each line and the output should be one file for each line.
I surf the internet and found MultipleOutputFormat, and generateFileNameForKeyValue.
But most people write it with JobConf class. As I'm using Hadoop 0.20.1, I think Job class takes place. And I don't know how to use Job class to generate multiple output files by key.
Could anyone help me?
The Eclipse plugin is mainly used to submit and monitor jobs as well as interact with HDFS, against a real or 'psuedo' cluster.
If you're running in local mode, then i don't think the plugin gains you anything - seeing as your job will be run in a single JVM. With this in mind i would say include include the most recent 1.x hadoop-core in your Eclipse project's classpath.
Eitherway MultipleOutputFormat has not been ported to the new mapreduce package (neither in 1.1.2 or 2.0.4-alpha), so you'll either need to port it yourself or find another way (maybe MultipleOutputs - The Javadoc page has some usage on using MultipleOutputs)