Locating the yarn logs on the cluster [duplicate] - hadoop

This question already has answers here:
Where does Hadoop store the logs of YARN applications?
(2 answers)
Closed 6 years ago.
I use
yarn logs -applicationId "id"
to show the logs on the command line, but I need to locate the files on the cluster..I wanted to know where the log saved on the cluster?

The yarn logs commands pulls the logs from HDFS where they are aggregated after the map reduce job completes (assuming logs aggregation is enabled). The location they are stored in is controlled by:
yarn.nodemanager.remote-app-log-dir
Inside that directory on HDFS you should find a sub directory for the user and then the logs inside another sub-directory.

Related

NiFi ListHDFS cannot find directory, FileNotFoundException

Have pipeline in NiFi of the form listHDFS->moveHDFS, attempting to run the pipeline we see the error log
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Returning CLUSTER State: StandardStateMap[version=43, values={emitted.timestamp=1525468790000, listing.timestamp=1525468790000}]
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Found new-style state stored, latesting timestamp emitted = 1525468790000, latest listed = 1525468790000
13:29:21 HSTDEBUG01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Fetching listing for /hdfs/path/to/dir
13:29:21 HSTERROR01631000-d439-1c41-9715-e0601d3b971c
ListHDFS[id=01631000-d439-1c41-9715-e0601d3b971c] Failed to perform listing of HDFS due to File /hdfs/path/to/dir does not exist: java.io.FileNotFoundException: File /hdfs/path/to/dir does not exist
Changing the listHDFS path to /tmp seems to run ok, thus making me think that the problem is with my permissions on the directory I'm trying to list. However, changing the NiFi user to a user that can access that directory (eg. hadoop fs -ls /hdfs/path/to/dir) by setting the bootstrap.properties value run.as=myuser and restarting (see https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#bootstrap_properties) still seems to produce the same problem for the directory. The literal dir. string being used that is not working is:
"/etl/ucera_internal/datagov_example/raw-ingest-tracking/version-1/ingest"
Does anyone know what is happening here? Thanks.
** Note: The hadoop cluster I am accessing does not have kerberos enabled (it is a secured MapR hadoop cluster).
Update: It appears that the mapr hadoop implementation is different enough that it requires special steps in order for NiFi to properly work on it (see https://community.mapr.com/thread/10484 and http://hariology.com/integrating-mapr-fs-and-apache-nifi/). May not get a chance to work on this problem for some time to see if still works (as certain requirements have changed), so am dumping the link here for others who may have this problem in the meantime.
Could you once make sure you have entered correct path and directory needs to be exists in HDFS.
It seems to be list hdfs processor not able to find the directory that you have configured in directory property and logs are not showing any permission denied issues.
If logs shows permission denied then you can change the nifi running user in bootstrap.conf and
Once you make change in nifi properties then NiFi needs to restart to apply the changes (or) change the permissions on the directory that NiFi can have access.

Using spark with s3 fails on EMR, despite hadoop access working [duplicate]

This question already has answers here:
Spark read file from S3 using sc.textFile ("s3n://...)
(14 answers)
Closed 5 years ago.
I am trying to access a s3:// path with
spark.read.parquet("s3://<path>")
And I get this error
Py4JJavaError: An error occurred while calling o31.parquet. :
java.io.IOException: No FileSystem for scheme: s3
However, running the following line
hadoop fs -ls <path>
Does work...
So I guess this might be a configuration issue between hadoop and spark
How can this be solved ?
EDIT
After reading the suggested answer, I've tried adding the jars hard coded to the spark config, with no success
spark = SparkSession\
.builder.master("spark://" + master + ":7077")\
.appName("myname")\
.config("spark.jars", "/usr/share/aws/aws-java-sdk/aws-java-sdk-1.11.221.jar,/usr/share/aws/aws-java-sdk/hadoop-aws.jar")\
.config("spark.jars.packages", "com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2")\
.getOrCreate()
No success
Hadoop aws dependency is missing in your project. Please add hadoop-aws in your build.

How do I get a working directory of Spark executor in Java? [duplicate]

This question already exists:
Copy files (config) from HDFS to local working directory of every spark executor
Closed 5 years ago.
I need to know a current working directory URI/URL of Spark executor so I can copy some dependencies there before job executes. How do I get in Java ? What api should I call?
Working directory is application specific so you want be able to get it before applications starts. It is best to use standard Spark mechanisms:
--jars / spark.jars - for JAR files.
pyFiles - for Python dependencies.
SparkFiles / --files / --archives - for everything else

spark history server does not show jobs or stages

We are trying to use spark history server to further improve our spark jobs. The spark job correctly writes the eventlog into HDFS and the spark history server also can access this eventlog: we do see the job in the spark history server job listing but aside from the environment variables and executors everything is empty...
Any ideas on how we can make the spark history server show everything (we really want to see the DAG for instance) ?
We are using spark 1.4.1.
Thanks.
I had a similar issue. I am browsing the history server with port forwarding with ssh. After granting the read permission to all the files in the log directory, they appear in my history server!
cd {SPARK_EVENT_LOG_DIR}
chmod +r * # grant the read permission to all users for all files

Spark: how to set worker-specific SPARK_HOME in standalone mode [duplicate]

This question already has answers here:
How to use start-all.sh to start standalone Worker that uses different SPARK_HOME (than Master)?
(3 answers)
Closed 4 months ago.
I'm setting up a [somewhat ad-hoc] cluster of Spark workers: namely, a couple of lab machines that I have sitting around. However, I've run into a problem when I attempt to start the cluster with start-all.sh: namely, Spark is installed in different directories on the various workers. But the master invokes $SPARK_HOME/sbin/start-all.sh on each one using the master's definition of $SPARK_HOME, even though the path is different for each worker.
Assuming I can't install Spark on identical paths on each worker to the master, how can I get the master to recognize the different worker paths?
EDIT #1 Hmm, found this thread in the Spark mailing list, strongly suggesting that this is the current implementation--assuming $SPARK_HOME is the same for all workers.
I'm playing around with Spark on Windows (my laptop) and have two worker nodes running by starting them manually using a script that contains the following
set SPARK_HOME=C:\dev\programs\spark-1.2.0-worker1
set SPARK_MASTER_IP=master.brad.com
spark-class org.apache.spark.deploy.worker.Worker spark://master.brad.com:7077
I then create a copy of this script with a different SPARK_HOME defined to run my second worker from. When I kick off a spark-submit I see this on Worker_1
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker1\bin...
and this on Worker_2
15/02/13 16:42:10 INFO ExecutorRunner: Launch command: ...C:\dev\programs\spark-1.2.0-worker2\bin...
So it works, and in my case I duplicated the spark installation directory, but you may be able to get around this
You might want to consider assign the name by changing SPARK_WORKER_DIR line in the spark-env.sh file.
A similar question was asked here
The solution I used was to create a symbolic link mimicking the master node's installation path on each worker node so when the start-all.sh executing on the master node does its SSH into the worker node, it will see identical pathing to run the worker scripts.
Example in my case, I had 2 Macs and 1 Linux machine. Both Macs had spark installed under /Users/<user>/spark however the Linux machine had it under /home/<user>/spark. One of the Macs was the master node so running the start-all.sh it would error each time on the Linux machine due to pathing (error: /Users/<user>/spark does not exist)).
The simple solution was to mimic the Mac's pathing on the Linux machine using a symbolic link:
open terminal
cd / <-- go to the root of the drive
sudo ln -s home Users <-- create a sym link "Users" pointing to the actual "home" directory.

Resources