Find and set Hadoop logs to verbose level - hadoop

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree

'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

Related

override hadoop user logs | queue specific

I have one hadoop job which is running in cluster of 300 nodes, for my job I have one specific queue in which job will get executed.
Job is running fine over production but it's generating too much log under userlogs folder for particular application id , I have executed hadoop merge command and get file of size of 290 GB.
I can see hadoop logging too much in syslog.
I have some queries over it , if anyone can guide me that would be great help for me -
1)- Logs in syslog is based on input data
2)- Logs in syslog based on hive query (As I can see all the entries are related to Hadoop processing, I don't think hive query have any impact in over creation of log)
3)- is there any way to reduce info in syslog for any specfic job running in huge cluser with interfering cluster configuration (for other jobs)
Logs in hadoop shows data from container allocation by YARN, Mapping, Reducing to the final result written.
Logging during Hive execution on a Hadoop cluster is controlled by
Hadoop configuration. Usually Hadoop will produce one log file per map
and reduce task stored on the cluster machine(s) where the task was
executed. The log files can be obtained by clicking through to the
Task Details page from the Hadoop JobTracker Web UI.
Refer: Hive Logging
To configure Hadoop logs, refer: How To Configure-Log4j_Configuration

Get status when running job without hadoop

When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....
You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.
In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Hive / Tez job won't start

I am trying to create an ORC table in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.
I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.
I tried many different variations, including this one that can be used as a reference to this question:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html
I have a single-node HDP cluster (being used for development) - version:
HDP-2.3.2.0
(2.3.2.0-2950)
And here are the relevant service versions:
Service Version Status Description
HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System
MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)
Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.
Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service
What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):
INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
My job stays like this:
Total number of applications (application-types: [] and states:
[SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1455989658079_0002 HIVE-3f41161c-b806-4e7d-974e-c18e028d683f TEZ hive root.hive ACCEPTED UNDEFINED 0% N/A
And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).
I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.
All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.
Any advice would be helpful.
Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :
task.resource.memory.mb
In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager
In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.

can the same code be used for both hadoop and yarn

I have been thinking about this question for a while now. I have been trying to compare the performance of hadoop 1 vs yarn by running the basic word count example. I am still unsure about how the same .jar file can be used to execute on both the frameworks. As far as I understand yarn has a different set of api's which it uses to set connection with resource manager, create an application master etc.
So if I develop an application(.jar), can it be run on both the frameworks without any change in code?
Also what could be meaningful parameters to differentiate hadoop vs yarn for a particular application?
Ok, let's clear up some terms here.
Hadoop is the umbrella system that contains the various components needed for distributed storage and processing. I believe the term you're looking for when you say hadoop 1 is MapReduce v1 (MRv1)
MRv1 is a component of Hadoop that includes the job tracker and task trackers. It only relies on HDFS.
YARN is a component of Hadoop that abstracts out the resource management part of MRv1.
MRv2 is the mapreduce application rewritten to run on top of YARN.
So when you're asking if hadoop 1 is interchangeable with YARN, you're probably actually asking if MRv1 is interchangeable with MRv2. And the answer is generally, yes. The Hadoop system knows how to run the same mapreduce application on both mapreduce platforms.
Adding to climbage's answer:
HADOOP Version 1
The JobTracker is responsible for resource management---managing the slave nodes--- major functions involve
tracking resource consumption/availability
job life-cycle management---scheduling individual tasks of the job, tracking progress, providing fault tolerance for tasks.
Issues with Hadoop v1
JobTracker is responsible for all spawned MR applications, it is a single point of failure---If JobTracker goes down, all applications in the cluster are killed. Moreover, if the cluster has a large number of applications, JobTracker becomes the performance bottleneck, to address the issues of scalability and job management Hadoop v2 was released.
Hadoop v2
The fundamental idea of YARN is to split the two major responsibilities of the Job-Tracker—that is, resource management and job scheduling/monitoring—into separate daemons: a global ResourceManager and a per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, operating system for managing applications in a distributed manner.
To interact with the new resourceManagement and Scheduling, A Hadoop YARN mapReduce Application is developed---MRv2 has nothing to do with the mapReduce programming API
Application programmers will see no difference between MRv1 and MRv2, MRv2 is fully backward compatible---Yes an application(.jar), can be run on both the frameworks without any change in code.
MapReduce was previously integrated in Hadoop Core---the only API to interact with data in HDFS. Now In Hadoop v2 it runs as a separate Application, Hadoop v2 allows other application programming frameworks---e.g MPI---to process HDFS data.

Resources