Using spark with s3 fails on EMR, despite hadoop access working [duplicate] - hadoop

This question already has answers here:
Spark read file from S3 using sc.textFile ("s3n://...)
(14 answers)
Closed 5 years ago.
I am trying to access a s3:// path with
spark.read.parquet("s3://<path>")
And I get this error
Py4JJavaError: An error occurred while calling o31.parquet. :
java.io.IOException: No FileSystem for scheme: s3
However, running the following line
hadoop fs -ls <path>
Does work...
So I guess this might be a configuration issue between hadoop and spark
How can this be solved ?
EDIT
After reading the suggested answer, I've tried adding the jars hard coded to the spark config, with no success
spark = SparkSession\
.builder.master("spark://" + master + ":7077")\
.appName("myname")\
.config("spark.jars", "/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.221.jar,/usr/share/aws/aws-java-sdk/hadoop-aws.jar")\
.config("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2")\
.getOrCreate()
No success

Hadoop aws dependency is missing in your project. Please add hadoop-aws in your build.

Related

Sqoop failing when importing as avro in AWS EMR

I'm trying to perform an sqoop import in Amazon EMR(hadoop 2.8.5 sqoop 1.4.7). The import goes pretty well when no avro option(--as-avrodatafile) is specified. But once it's set, the job is failing with
19/10/29 21:31:35 INFO mapreduce.Job: Task Id : attempt_1572305702067_0017_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
Using this option -D mapreduce.job.user.classpath.first=true doesn't work.
Running locally(in my machine) I found that copying the avro-1.8.1.jar in sqoop to hadoop lib folder works, but in the EMR cluster I have only access to the master node, so doing the above doesn't work because it isn't the master node who runs the jobs.
Did anyone face this problem?
The solution I found was to connect to every node in the cluster(I thought I only had access to the master node, but I was wrong, in EMR we have access to all nodes) and replace the Avro jar that is included with Hadoop by the Avro jar that comes in Sqoop. It's not an elegant solution but it works.
[UPDATE]
Happened that the option -D mapreduce.job.user.classpath.first=true wasn't working because I was using s3a as target dir when Amazon says that we should use s3. As soon as I started using s3 Sqoop could perform the import correctly. So, no need of replacing any file in the nodes. Using s3a could lead to some strange errors under EMR due to Amazon own configuration, don't use it. Even in terms of performance s3 is better than s3a in EMR as the implementation for s3 is Amazon's.

Rename files created in hadoop - Spark [duplicate]

This question already has an answer here:
HDFS: move multiple files using Java / Scala API
(1 answer)
Closed 5 years ago.
The files created in HDFS via write has its own naming convention. To change it to custom name there is an option via script using hadoop fs -mv oldname newname
Is there any other option available in Spark/ Hadoop to provide custom name to created file.
Apache Spark does not provide any Api for file system operations in hdfs. But you can always use the Hadoop file system APIs to rename the file in HDFS. Check this for more details of the Hadoop file system APIs available. For renaming , the following will work :
val conf = new Configuration();
val fileSystem = FileSystem.get(conf);
fileSystem.mkdir(new Path(newhdfs_dirPath));
fileSystem.rename(new Path(existinghdfs_dirpath+oldname), new Path(newhdfs_dirPath+newname));

How do I get a working directory of Spark executor in Java? [duplicate]

This question already exists:
Copy files (config) from HDFS to local working directory of every spark executor
Closed 5 years ago.
I need to know a current working directory URI/URL of Spark executor so I can copy some dependencies there before job executes. How do I get in Java ? What api should I call?
Working directory is application specific so you want be able to get it before applications starts. It is best to use standard Spark mechanisms:
--jars / spark.jars - for JAR files.
pyFiles - for Python dependencies.
SparkFiles / --files / --archives - for everything else

Locating the yarn logs on the cluster [duplicate]

This question already has answers here:
Where does Hadoop store the logs of YARN applications?
(2 answers)
Closed 6 years ago.
I use
yarn logs -applicationId "id"
to show the logs on the command line, but I need to locate the files on the cluster..I wanted to know where the log saved on the cluster?
The yarn logs commands pulls the logs from HDFS where they are aggregated after the map reduce job completes (assuming logs aggregation is enabled). The location they are stored in is controlled by:
yarn.nodemanager.remote-app-log-dir
Inside that directory on HDFS you should find a sub directory for the user and then the logs inside another sub-directory.

Error: -copyFromLocal: java.net.UnknownHostException

I am new at Java, Hadoop etc.
I am having a problem when trying to copy a file to HDFS.
It says: "-copyFromLocal: java.net.UnknownHostException: quickstart.cloudera (...)"
How can I solve this? It is a exercise. You can see the problem in the imagem below.
Image with the problem
Image 2 with the error
Thank you very much.
As error says you need to supply the HDFS folder path as destination. So the code should be like:
hadoop fs -copyFromLocal words.txt /HDFS/Folder/Path
Almost all errors that you get while working in Hadoop are Java errors as MapReduce was mostly written in Java. But that doesnt mean there is some Java error in it.

Resources