How do I get a working directory of Spark executor in Java? [duplicate] - hadoop

This question already exists:
Copy files (config) from HDFS to local working directory of every spark executor
Closed 5 years ago.
I need to know a current working directory URI/URL of Spark executor so I can copy some dependencies there before job executes. How do I get in Java ? What api should I call?

Working directory is application specific so you want be able to get it before applications starts. It is best to use standard Spark mechanisms:
--jars / spark.jars - for JAR files.
pyFiles - for Python dependencies.
SparkFiles / --files / --archives - for everything else

Related

Using spark with s3 fails on EMR, despite hadoop access working [duplicate]

This question already has answers here:
Spark read file from S3 using sc.textFile ("s3n://...)
(14 answers)
Closed 5 years ago.
I am trying to access a s3:// path with
spark.read.parquet("s3://<path>")
And I get this error
Py4JJavaError: An error occurred while calling o31.parquet. :
java.io.IOException: No FileSystem for scheme: s3
However, running the following line
hadoop fs -ls <path>
Does work...
So I guess this might be a configuration issue between hadoop and spark
How can this be solved ?
EDIT
After reading the suggested answer, I've tried adding the jars hard coded to the spark config, with no success
spark = SparkSession\
.builder.master("spark://" + master + ":7077")\
.appName("myname")\
.config("spark.jars", "/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.221.jar,/usr/share/aws/aws-java-sdk/hadoop-aws.jar")\
.config("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2")\
.getOrCreate()
No success
Hadoop aws dependency is missing in your project. Please add hadoop-aws in your build.

Rename files created in hadoop - Spark [duplicate]

This question already has an answer here:
HDFS: move multiple files using Java / Scala API
(1 answer)
Closed 5 years ago.
The files created in HDFS via write has its own naming convention. To change it to custom name there is an option via script using hadoop fs -mv oldname newname
Is there any other option available in Spark/ Hadoop to provide custom name to created file.
Apache Spark does not provide any Api for file system operations in hdfs. But you can always use the Hadoop file system APIs to rename the file in HDFS. Check this for more details of the Hadoop file system APIs available. For renaming , the following will work :
val conf = new Configuration();
val fileSystem = FileSystem.get(conf);
fileSystem.mkdir(new Path(newhdfs_dirPath));
fileSystem.rename(new Path(existinghdfs_dirpath+oldname), new Path(newhdfs_dirPath+newname));

Storm Topology deployment using pre deployed jar

We currently have a jar that contains 6 topologies. To deploy these topologies we currently do 6 separate calls using
/bin/storm jar $LOCAL_JAR $TOPOLOGY_CLASS $TOPOLOGY_NAME $PS_ENV $ZK_QUORUM -c nimbus.host=$NIMBUS_HOST $STORM_CONFIG_ARGS
Looking at the log output, each time the topology is submitted the jar is also uploaded to nimbus i.e there are 6 lines like this
9937 [main] INFO o.a.s.StormSubmitter - Successfully uploaded topology jar to assigned location:...
I want to avoid the multiple uploading of the jar. I have tried uploading the jar via scp and placing it in at "uploadedJarLocation" on the nimbus node ( I do this once).
Then changing my deployment code to use the following for each of the topologies.
nimbusClient = NimbusClient.getConfiguredClient(storm_conf);
client = nimbusClient.getClient();
...
client.submitTopology(topologyName, uploadedJarLocation, jsonConf, topology.buildTopology());
This has sped things up and seems to work fine but I want to ask
Is this a safe approach, can I safely reference the uploadedJarLocation I pre-uploaded to nimbus via scp?
Are there any alternative methods to avoid the multiple jar upload?
I know about the StormSubmitter.submitJar as an alternative but have found this to be slow.

Create hdfs when using integrated spark build

I'm working with Windows and trying to set up Spark.
Previously I installed Hadoop in addition to Spark, edited the config files, run the hadoop namenode -format and away we went.
I'm now trying to achieve the same by using the bundled version of Spark that is pre built with hadoop - spark-1.6.1-bin-hadoop2.6.tgz
So far it's been a much cleaner, simpler process however I no longer have access to the command that creates the hdfs, the config files for the hdfs are no longer present and I've no 'hadoop' in any of the bin folders.
There wasn't an Hadoop folder in the spark install, I created one for the purpose of winutils.exe.
It feels like I've missed something. Do the pre-built versions of spark not include hadoop? Is this functionality missing from this variant or is there something else that I'm overlooking?
Thanks for any help.
By saying that Spark is built with Hadoop, it is meant that Spark is built with the dependencies of Hadoop, i.e. with the clients for accessing Hadoop (or HDFS, to be more precise).
Thus, if you use a version of Spark which is built for Hadoop 2.6 you will be able to access HDFS filesystem of a cluster with the version 2.6 of Hadoop via Spark.
It doesn't mean that Hadoop is part of the pakage and downloading it Hadoop is installed as well. You have to install Hadoop separately.
If you download a Spark release without Hadoop support, you'll need to include the Hadoop client libraries in all the applications you write wiƬhich are supposed to access HDFS (by a textFile for instance).
I am also using same spark in my windows 10. What I have done create C:\winutils\bin directory and put winutils.exe there. Than create HADOOP_HOME=C:\winutils variable. If you have set all
env variables and PATH like SPARK_HOME,HADOOP_HOME etc than it should work.

hadoop - Where are input/output files stored in hadoop and how to execute java file in hadoop?

Suppose I write a java program and i want to run it in Hadoop, then
where should the file be saved?
how to access it from hadoop?
should i be calling it by the following command? hadoop classname
what is the command in hadoop to execute the java file?
The simplest answers I can think of to your questions are:
1) Anywhere
2,3,4)$HADOOP_HOME/bin/hadoop jar [path_to_your_jar_file]
A similar question was asked here Executing helloworld.java in apache hadoop
It may seem complicated, but it's simpler than you might think!
Compile your map/reduce classes, and your main class into a jar. Let's call this jar myjob.jar.
This jar does not need to include the Hadoop libraries, but it should include any other dependencies you have.
Your main method should set up and run your map/reduce job, here is an example.
Put this jar on any machine with the hadoop command line utility installed.
Run your main method using the hadoop command line utility:
hadoop jar myjob.jar
Hope that helps.
where should the file be saved?
The data should be saved in "hdfs". You will want to probably load it into the cluster from your data source using something like Apache Flume. The file can be placed anywhere but most home is /user/hadoop/
how to access it from hadoop?
SSH into the hadoop cluster headnode like a standard linux server.
To list your hadoop root hdfs
hadoop fs -ls /
should i be calling it by the following command? hadoop classname
You should be using the hadoop command to access your data and run your programs, try hadoop help
what is the command in hadoop to execute the java file?
hadoop -jar MyJar.jar com.mycompany.MainDriver arg[0] arg[1] ...

Resources