override hadoop user logs | queue specific - hadoop

I have one hadoop job which is running in cluster of 300 nodes, for my job I have one specific queue in which job will get executed.
Job is running fine over production but it's generating too much log under userlogs folder for particular application id , I have executed hadoop merge command and get file of size of 290 GB.
I can see hadoop logging too much in syslog.
I have some queries over it , if anyone can guide me that would be great help for me -
1)- Logs in syslog is based on input data
2)- Logs in syslog based on hive query (As I can see all the entries are related to Hadoop processing, I don't think hive query have any impact in over creation of log)
3)- is there any way to reduce info in syslog for any specfic job running in huge cluser with interfering cluster configuration (for other jobs)

Logs in hadoop shows data from container allocation by YARN, Mapping, Reducing to the final result written.
Logging during Hive execution on a Hadoop cluster is controlled by
Hadoop configuration. Usually Hadoop will produce one log file per map
and reduce task stored on the cluster machine(s) where the task was
executed. The log files can be obtained by clicking through to the
Task Details page from the Hadoop JobTracker Web UI.
Refer: Hive Logging
To configure Hadoop logs, refer: How To Configure-Log4j_Configuration

Related

spark Yarn mode how to get applicationId from spark-submit

When I submit spark job using spark-submit with master yarn and deploy-mode cluster, it doesn't print/return any applicationId and once job is completed I have to manually check MapReduce jobHistory or spark HistoryServer to get the job details.
My cluster is used by many users and it takes lot of time to spot my job in jobHistory/HistoryServer.
is there any way to configure spark-submit to return the applicationId?
Note: I found many similar questions but their solutions retrieve applicationId within the driver code using sparkcontext.applicationId and in case of master yarn and deploy-mode cluster the driver also run as a part of mapreduce job, any logs or sysout printed to remote host log.
Here are the approaches that I used to achieve this:
Save the application Id to HDFS file. (Suggested by #zhangtong in comment).
Send an email alert with applictionId from driver.

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

Hive / Tez job won't start

I am trying to create an ORC table in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.
I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.
I tried many different variations, including this one that can be used as a reference to this question:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html
I have a single-node HDP cluster (being used for development) - version:
HDP-2.3.2.0
(2.3.2.0-2950)
And here are the relevant service versions:
Service Version Status Description
HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System
MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.
Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
My job stays like this:
Total number of applications (application-types: [] and states:
[SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1455989658079_0002 HIVE-3f41161c-b806-4e7d-974e-c18e028d683f TEZ hive root.hive ACCEPTED UNDEFINED 0% N/A
And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).
I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.
All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.
Any advice would be helpful.
Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :
task.resource.memory.mb
In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager
In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.

How to get a spark job's metrics?

we have a cluster which has about 20 nodes. This cluster is shared among many users and jobs. Therefore, it is very difficult for me to observe my job so that I can get some metrics such as CPU usage, I/O, Network, Memory etc...
How can I get a metrics on job level.
PS: The cluster already have Ganglia installed but not sure how I could get it to work on the job level. What I would like to do is monitor the resource used by the cluster to execute my job only.
You can get the spark job metrics from Spark History Server, which displays information about:
- A list of scheduler stages and tasks
- A summary of RDD sizes and memory usage
- A Environmental information
- A Information about the running executors
1, Set spark.eventLog.enabled to true before starting the spark application. This configures Spark to log Spark events to persisted storage.
2, Set spark.history.fs.logDirectory, this is the directory that contains application event logs to be loaded by the history server;
3, Start the history server by executing: ./sbin/start-history-server.sh
please refer to below link for more information:
http://spark.apache.org/docs/latest/monitoring.html

Hadoop removes MapReduce history when it is restarted

I am carrying out several Hadoop tests using TestDFSIO and TeraSort benchmark tools. I am basically testing with different amount of datanodes in order to assess the linearity of the processing capacity and datanode scalability.
During the above mentioned process, I have obviously had to restart several times all Hadoop environment. Every time I restarted Hadoop, all MapReduce jobs are removed and the job counter starts again from "job_2013*_0001". For comparison reasons, it is very important for me to keep all the MapReduce jobs up that I have previously launched. So, my question is:
¿How can I avoid Hadoop removes all MapReduce-job history after it is restarted?
¿Is there some property to control job removing after Hadoop environment restarting?
Thanks!
the MR job history logs are not deleted right way after you restart hadoop, the new job will be counted from *_0001 and only new jobs which are started after hadoop restart will be displayed on resource manager web portal though. In fact, there are 2 log related settings from yarn default:
# this is where you can find the MR job history logs
yarn.nodemanager.log-dirs = ${yarn.log.dir}/userlogs
# this is how long the history logs will be retained
yarn.nodemanager.log.retain-seconds = 10800
and the default ${yarn.log.dir} is defined in $HADOOP_HONE/etc/hadoop/yarn-env.sh.
YARN_LOG_DIR="$HADOOP_YARN_HOME/logs"
BTW, similar settings could also be found in mapred-env.sh if you are use Hadoop 1.X

Resources