How to get a spark job's metrics?

How to get a spark job's metrics? - performance

we have a cluster which has about 20 nodes. This cluster is shared among many users and jobs. Therefore, it is very difficult for me to observe my job so that I can get some metrics such as CPU usage, I/O, Network, Memory etc...
How can I get a metrics on job level.
PS: The cluster already have Ganglia installed but not sure how I could get it to work on the job level. What I would like to do is monitor the resource used by the cluster to execute my job only.

You can get the spark job metrics from Spark History Server, which displays information about:
- A list of scheduler stages and tasks
- A summary of RDD sizes and memory usage
- A Environmental information
- A Information about the running executors
1, Set spark.eventLog.enabled to true before starting the spark application. This configures Spark to log Spark events to persisted storage.
2, Set spark.history.fs.logDirectory, this is the directory that contains application event logs to be loaded by the history server;
3, Start the history server by executing: ./sbin/start-history-server.sh
please refer to below link for more information:
http://spark.apache.org/docs/latest/monitoring.html

Related

override hadoop user logs | queue specific

I have one hadoop job which is running in cluster of 300 nodes, for my job I have one specific queue in which job will get executed.
Job is running fine over production but it's generating too much log under userlogs folder for particular application id , I have executed hadoop merge command and get file of size of 290 GB.
I can see hadoop logging too much in syslog.
I have some queries over it , if anyone can guide me that would be great help for me -
1)- Logs in syslog is based on input data
2)- Logs in syslog based on hive query (As I can see all the entries are related to Hadoop processing, I don't think hive query have any impact in over creation of log)
3)- is there any way to reduce info in syslog for any specfic job running in huge cluser with interfering cluster configuration (for other jobs)

Logs in hadoop shows data from container allocation by YARN, Mapping, Reducing to the final result written.
Logging during Hive execution on a Hadoop cluster is controlled by
Hadoop configuration. Usually Hadoop will produce one log file per map
and reduce task stored on the cluster machine(s) where the task was
executed. The log files can be obtained by clicking through to the
Task Details page from the Hadoop JobTracker Web UI.
Refer: Hive Logging
To configure Hadoop logs, refer: How To Configure-Log4j_Configuration

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree

'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

How to get CPU usage profile of a yarn based Spark application

I am running an Apache Spark application using yarn, on a hadoop cluster. After the program is finished, is there a way I can check the profile of CPU usage of that program. Basically, I want a profiling log at intervals of say 1 or 2 seconds.

You can use the ResourceManager rest API´s
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Applications_API
Basically you require to implement a REST client that query ResourManager each 1 or 2 seconds and create your own logs at runtime.

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?

Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.

I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!

the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio